[PDF] Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

Abstract

Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.

Full PDF

11 Overview and Evaluation of Sound EventLocalization and Detection in DCASE 2019

Archontis Politis, Annamaria Mesaros, Sharath Adavanne,Toni Heittola, Tuomas Virtanen,

Abstract —Sound event localization and detection is a novelarea of research that emerged from the combined interest ofanalyzing the acoustic scene in terms of the spatial and temporalactivity of sounds of interest. This paper presents an overviewof the ﬁrst international evaluation on sound event localizationand detection, organized as a task of DCASE 2019 Challenge.A large-scale realistic dataset of spatialized sound events wasgenerated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions inan unlabeled subset. The overview presents in detail how thesystems were evaluated and ranked and the characteristics of thebest-performing systems. Common strategies in terms of inputfeatures, model architectures, training approaches, exploitationof prior knowledge, and data augmentation are discussed. Sinceranking in the challenge was based on individually evaluatinglocalization and event classiﬁcation performance, part of theoverview focuses on presenting metrics for the joint measurementof the two, together with a re-evaluation of submissions usingthese new metrics. The analysis reveals submissions with balancedperformance on classifying sounds correctly close to their originallocation, and systems being strong on one or both of the two tasks,but not jointly.

Index Terms —Sound event localization and detection, soundsource localization, acoustic scene analysis, microphone arrays

I. I

NTRODUCTION

Recognition of the classes of sound events in an audiorecording and identiﬁcation of their occurrences in time is acurrently active topic of research, popularized as sound eventdetection (SED), with a wide range of applications [1]. WhileSED can reveal a lot about the recording environment, thespatial locations of events can bring valuable information formany applications. On the other hand, sound source localiza-tion is a classic multichannel signal processing task, based onsound propagation properties and signal relationships betweenchannels, without considering the type of sound characterizingthe sound source. A sound event localization and detection(SELD) system aims to a more complete spatiotemporalcharacterization of the acoustic scene by bringing SED andsource localization together. The spatial dimension makesSELD suitable for a wide range of machine listening tasks,such as inference on the type of environment, self-localization,navigation without visual input or with occluded targets,tracking of sound sources of interest, and audio surveillance.

This work received funding from the European Research Council under theERC Grant Agreement 637422 EVERYSOUND.A. Politis, A. Mesaros, S.Adavanne, T. Heittola and T. Virtanen arewith the Faculty of Information Technology and Communication Sciences,Tampere University, Finland, e-mail: { archontis.politis, annamaria.mesaros,tuomas.virtanen } @tuni.ﬁ Additionally, it can aid human-machine interaction, in scene-information visualization systems, scene-based deployment ofservices, and assisted-hearing devices, among others.The SELD task was included for the ﬁrst time in theDetection and Classiﬁcation of Acoustic Scenes and Events(DCASE) Challenge of 2019 . In addition to the related studiesthat aim at detecting and localizing multiple speakers (seee.g. [2]), only a handful of approaches could be found in theliterature up to that point [3]–[9]. Earlier studies were treatingthe two problems of detection and localization separately,without trying to associate source positions and events. Inthose works, Gaussian mixture models (GMMs) [3], hiddenMarkov models (HMMs) [4], or support vector machines [6]were used for detection, while localization relied on classicarray processing approaches such as time-difference-of-arrival(TDOA) [3], steered-response power [4], or acoustic intensityvector analysis [6]. An early attempt in joining estimates fromthe two problems was presented in [5], where beamformingoutputs from distributed arrays along with an HMM-GMMclassiﬁer are used to build a maximum-a-posteriori criterionon the most probable position in a room of a certain class.During the last decade, deep-neural-networks (DNNs) havebecome the most established method on SED, offering amplemodeling ﬂexibility and surpassing traditional machine learn-ing methods when trained with adequate data [10]. Recently,DNNs have been explored also for machine learning-basedsource localization [11]–[13] with promising results. Hence,DNNs seems like a good candidate for joint modeling oflocalization and detection in the SELD task. The ﬁrst workswe are aware of this approach are [8] and [9]. Hirvonen[8] proposed to set joint-modeling as a multilabel-multiclassclassiﬁcation problem, mapping two event classes to eightdiscrete angles in azimuth. A convolutional neural network(CNN) was trained to infer probabilities of each sound classat each position, after which a predeﬁned threshold was usedto decide the ﬁnal class presence and location. Adavanne et al.[9] proposed as an alternative a regression-based localizationapproach. Modeling was performed by a convolutional and re-current neural network (CRNN) with two output branches, oneperforming SED and the other localization. In the localizationbranch, one regressor per class returned a continuous azimuth-elevation angle. Binary thresholding was used in the detectionbranch to indicate the temporal activity of each class, andthat output was used to gate the respective direction-of-arrival(DoA) output, joining them together during inference. The http://dcase.community/challenge2019/ a r X i v : . [ ee ss . A S ] S e p proposed system, named SELDnet, was extensively comparedagainst other architectures, for a variety of simulated andreal data, and for different array conﬁgurations. Note thatboth DNN-based proposals were using simple generic inputfeatures, such as multichannel power spectrograms in [8], andmagnitude and phase spectrograms in [9].Due to its relevance in the aforementioned applications,the SELD task was introduced for the ﬁrst time in theDCASE 2019 Challenge and received a remarkable numberof submissions for a novel topic. A new dataset of spatializedsound events was generated for the task [14], and a SELDnetimplementation was provided by the authors as a baseline forthe challenge participants . Beyond the works associated withthe challenge [15]–[36], multiple works have followed aimingto address the SELD task in a new way or improve on thelimitations of the challenge submissions [37]–[40].This paper serves three major aims. Firstly, it presentsan overview of the ﬁrst SELD-related challenge. Secondly,it presents common considerations of SELD systems anddiscusses how these were addressed by the participants, high-lighting novel solutions and common elements of the chal-lenge submissions. Thirdly, the performance of the systems isanalyzed by addressing the issue of evaluating joint detectionand localization. Following the ranking of the systems in thechallenge, we calculate conﬁdence intervals for the challengeevaluation metrics and analyze submissions with respect totheir performance in detection and localization separately.Additionally, we re-evaluate the systems using novel metricsproposed for joint evaluation of localization and detection [41],and investigate correlations between the different metrics andthe ranking of the systems.The paper is organized as follows: Section II presents thetask description, dataset, baseline system, and evaluation, asdeﬁned in the challenge. Section III introduces and formulatesthe joint metrics for evaluation of localization and detection.Section IV presents the analysis of submitted systems, includ-ing the challenge results and detailed systems characteristics.In Section V we re-evaluate the submissions with the newjoint metrics, and analyze the results with a rank correlationanalysis of the different metrics. Finally, Section VI presentsthe concluding remarks on the challenge task organization.II. S OUND EVENT DETECTION AND LOCALIZATION IN

DCASE 2019 C

HALLENGE

The goal of the SELD task, given a multichannel recording,can be summarized as identifying individual sound events froma set of given classes, their temporal onset and offset timesin the recording, and their spatial trajectories while they areactive. In the 2019 challenge, the spatial parameter was thedirection-of-arrival (DoA) in azimuth and elevation, and onlystatic scenes were considered, meaning that each individualsound event instance in the provided recordings was spatiallystationary with a ﬁxed location during its entire duration. Anexample of such a system is shown in Fig. 1. https://github.com/sharathadavanne/seld-dcase2019 Fig. 1. A SELD system example and the baseline of the challenge (SELDnet). A. Dataset

Creating a dataset for a SELD task presents some chal-lenges, reﬂecting the high complexity of the problem. Ideally,a large range of sound events representative of each soundclass should be reproduced at different times and temporaloverlaps, at an enormous range of different positions in az-imuth, elevation, and possibly distance from the microphones,covering the localization domain of interest. Furthermore, ifthe system is to be robust to varying acoustic conditions anddifferent spaces, all the previous dimensions should be variedacross different rooms. Staging real recordings with this degreeof variability is not practical. Acoustic simulations of spatialroom impulse responses (RIRs) for various rooms shapes andpositions, and then subsequent convolution of the sound eventsamples with them is a viable alternative, explored for examplein [9]. However, such simulators, with simpliﬁcations onroom geometry and acoustic scattering behavior, can deviatesigniﬁcantly from real spatial RIRs. Additionally, the non-directional ambient noise characteristic of the function of eachspace is present in reality, adding another component theSELD system should be robust to.For DCASE2019, we opted for a hybrid recording-simulation strategy that allowed us to control the detection,localization, and acoustical variability we needed. Real-lifeimpulse responses were recorded at 5 indoor locations in theHervanta campus of Tampere University, at 504 unique com-binations of azimuth-elevation-distance around the recordingposition. The measurements were covering a domain of 360 ◦ in azimuth, -40 ◦ ∼ ◦ in elevation, and 1 ∼

2m in distance.Additionally, realistic ambient noise was recorded on-site withthe recording setup unchanged.Each spatial sound recording was synthesized as a one- minute multichannel mixture of spatialized sound events con-volved with RIRs from the same space, with randomizedonsets and source positions, and with up to two simultaneousevents allowed. The IRs were convolved with the isolatedsound events dataset provided with DCASE 2016 Task 2Sound event detection in synthetic audio , containing 20 eventsamples for each of the 11 event classes. Finally, the recordednatural ambient noise from the same space was added to thesynthesized mixture, at a 30 dB signal-to-noise ratio relativeto the average power of the sound-event mixture at the arraychannels. Each mixture was provided in two different 4-channel recording formats, extracted from the same 32-channelrecording equipment. The ﬁrst was a tetrahedral microphonearray of capsules mounted on a hard spherical body, while thesecond was the ﬁrst-order Ambisonics spatial audio format.The two recording formats offer different possibilities in ex-ploiting the spatial information captured between the channels.A development set was available during the challenge , andfor the evaluation set only the audio without labels wasreleased . The development and evaluation sets consist of 400and 100 recordings respectively. A detailed description of thegeneration of the dataset is given in [14]. B. Baseline system

The SELDnet architecture of [9] was provided as thebaseline architecture of the challenge. The rationale behindthis choice was its conceptual and implementation simplicity,and its generality with respect to input features. Furthermore,even though SELDnet was very recent and had the bestresults between the tested methods in its publication, it stillleft a signiﬁcant margin for improvements with realistic data,both at localization and detection accuracy. The architectureof the system is depicted in Fig. 1. It consists of threeconvolutional layers modeling spatial interchannel and soundevent intrachannel time-frequency representations, followedby two bi-directional recurrent layers with gated recurrentunits (GRU) capturing longer temporal dependencies in thedata. The following two output branches of fully-connectedlayers correspond to the individual tasks of SED and DoAestimation. The SED output is optimized with a cross-entropyloss, while the DoA output is optimized using the meansquared error of angular distances between reference andpredicted DoAs. Contrary to the original SELDnet in [9] whichwas outputting Cartesian vector DoAs, the implementationfor the challenge is returning directly azimuth and elevationangles. The network takes as input multichannel magnitudeand phase spectrograms, stacked along the channel dimension.Reference SED outputs are expressed with one-hot encodingand reference DoAs with azimuth and elevation angles inradians. The network is trained using the Adam optimizerwith a weighted combination of the two output losses, withmore weight given to the localization loss. More details on theSELDnet challenge implementation can be found in [14]. https://archive.org/details/dcase2016 task2 train dev http://dcase.community/challenge2016/task-sound-event-detection-in-synthetic-audio https://zenodo.org/record/2580091 https://zenodo.org/record/3066124 C. Evaluation and ranking

In this ﬁrst implementation of the challenge the submittedsystems were evaluated with respect to their detection andlocalization performance individually. For SED, the detectionmetrics were the F -score and error rate ( ER ) computedin non-overlapping one-second segments [42]. For DoA es-timation, two additional frame-wise metrics were used. Theﬁrst is a conventional directional error ( DE ) expressingthe angular distance between reference and predicted DoAs.Since multiple simultaneous estimates are possible, referencesand predictions need to be associated before errors can becomputed. The Hungarian algorithm [43] was used for thatpurpose, and the ﬁnal DE was computed as the minimumcost association, divided with the number of associated DoAs.Since DE does not reﬂect on how successfully a systemdetects localizable events, a second recall-type metric wasintroduced, termed frame recall ( F R ). Due to a more generalintroduction and reformulation of the metrics, DE is renamedin this work as localization error ( LE ), while F R is renamedas event count recall ( ECR ).For a detailed picture of the overall performance, thesubmissions were ranked individually for each of the four ( F , ER, LE, ECR ) metrics. A total ranking aiming to in-dicate systems achieving good performance in all metrics,or exceptional performance in most of them, was obtainedby summing the individual ranks and sorting the results inincreasing order.III. J OINT MEASUREMENT OF LOCALIZATION ANDDETECTION PERFORMANCE

Sound localization and sound event detection are tradition-ally two different areas of research, but the recent researchaddresses joint modeling and prediction of the two, motivatinga joint evaluation. An example case to illustrate the maindrawback of employing separate evaluations for detection andlocalization (similar to Subsection II-C) is visualized in Fig. 2.Both the participating systems have detected the two soundevents correctly, however, their spatial positions are swapped.Using a standalone detection metric will evaluate if the systemhas correctly predicted the sound events, and similarly, astandalone localization metric will evaluate the spatial errorsbetween the closest sound pairs (ignoring the underlying soundclasses), resulting in a perfect score for both systems in bothaspects, despite the obvious error.

A. Metrics formulation

Since a spatial event is not distinguished only by its class,but also by its location, measurement ideally happens atthe event level. Let us consider a SELD system that ata given temporal step predicts a set of M events P = { p , ..., p i , ..., p M } , where each event prediction is associatedwith a class label index ˜ b i and a positional vector ˜ x i , suchthat p i = { ˜ b i , ˜ x i } . At the same time, N reference events existas R = { r , ..., r j , ..., r N } , with each reference event being ofclass index b j at position x j , denoted as r j = { b j , x j } . Weassume a total of C possible class labels that are ordered, suchthat b ∈ [1 , .., C ] . Note that contrary to traditional SED, where reference: Dogoutput: Dog reference: Catoutput: Cat System 1

Detec (cid:1) on F1-score:

Localiza (cid:0) on error: reference: Dogoutput: Cat reference: Catoutput:

Dog

System 2

Detec (cid:1) on F1-score:

Localiza (cid:0) on error: Fig. 2. Example reference and predicted sound events and locations. Circles denote reference sounds, rectangles system output. Two systems evaluatedseparately for detection and localization performance. Based on the measured performance, they both have perfect score. predictions and references are class-based, it is possible thatmore than one events in P or R are of the same class.We begin by considering localization-only metrics, neglect-ing classiﬁcation. Every combination of prediction ˜ x i and ref-erence x j is associated spatially with an appropriate distancemetric d (˜ x i , x j ) , such as angular distance in the case of DoAs,or Euclidean distance in the case of Cartesian positions. Suchdistances can be expressed with an M × N distance matrix D , where each element is given by [ D ] ij = d (˜ x i , x j ) . Beforemeasuring a mean LE across events, references and predic-tions should be associated using, for example, a minimumcost assignment algorithm such as the Hungarian algorithm, A = H ( D ) . The M × N binary association matrix A canhave maximum one unity entry at each column and row,meaning that only K = min( M, N ) = || A || predictions andreferences are associated and contribute to the LELE = 1 K (cid:88) i,j a ij d ij = || A (cid:12) D || || A || , (1)where || · || is the L , entrywise matrix norm, and (cid:12) theentrywise matrix product.The above localization precision gives a partial performancepicture because it does not take into account misses or falsealarms of localized sounds. To that purpose, we introduce asimple metric termed localization recall ( LR ), expressed as LR = (cid:80) l min( M ( l ) , N ( l ) ) (cid:80) l N ( l ) = (cid:80) l || A ( l ) || (cid:80) l N ( l ) , (2)where summation happens across temporal frame outputs,or some other preferred averaged segmental representation.Finally, a related but more concentrated metric of interest maybe the number of frames or segments for which the systemdetects the correct number of references M = N . We namethis metric event count recall ( ECR ). ECR corresponds to

ECR = (cid:80) l (cid:0) M ( l ) = N ( l ) (cid:1) L , (3)where L is the total number of segments, and ( · ) is theindicator function, returning one if its argument is true, andzero otherwise. Note that ECR was termed frame recall inthe challenge evaluation, and in [9], [11], but we opted herefor a more descriptive name of its counting objective.Often, a localization method needs to be evaluated onlyunder a certain level of spatial precision, usually expressedthrough an application-dependent threshold Θ . Such a thresh-old on the above metrics can be applied by constructing an M × N binary matrix T with unity entries only on the asso-ciated reference-predictions that are closer than the threshold, [ T ] ij = ([ D ] ij ≤ Θ) . The number of associated predictionsthat pass the threshold are then given by K ≤ Θ = || T (cid:12) A || .The thresholded metrics are LE ≤ Θ = 1 K ≤ Θ (cid:88) i,j t ij a ij d ij = || T (cid:12) A (cid:12) D || || T (cid:12) A || (4) LR ≤ Θ = (cid:80) l K ( l ) ≤ Θ (cid:80) l N ( l ) = (cid:80) l || T ( l ) (cid:12) A ( l ) || (cid:80) l N ( l ) (5) ECR ≤ Θ = (cid:80) l (cid:16) K ( l ) ≤ Θ = N ( l ) (cid:17) L . (6)Considering the fact that events have a class label in SELD,it is more informative to measure localization performanceonly between events that are correctly classiﬁed (class-awarelocalization). Similarly, we may want to impose a spatialconstraint on correct classiﬁcations, such that events classiﬁedcorrectly, but very far from their spatial reference are consid-ered invalid (location-aware detection). For both modes, we:1) Find subsets P c = { p i | ˜ b i = c } of predictions and R c = { r j | b j = c } of reference events classiﬁed on class c ∈ [1 , ..., C ] . The resulting class-speciﬁc number ofpredictions is M c and of references N c .2) Compute a class-dependent M c × N c distance matrix D c between predictions P c and references R c , and computethe respective association matrix A c = H ( D c ) .3) Determine a suitable application-speciﬁc spatial thresh-old Θ , for location-aware detection. Construct thethresholding binary matrix T c from D c , and determinethe number of associated predictions K c = || A c || =min( M c , N c ) , and the number of associated predictionswhich pass the threshold K c, ≤ Θ = || T c (cid:12) A c || .4) After association, count true positives T P , false nega-tives

F N , and false positives

F P as follows:

T P c, ≤ Θ = K c, ≤ Θ (7) F P c, ≤ Θ = max(0 , M c − N c ) + min( M c , N c ) − K c, ≤ Θ (8) F N c = max(0 , N c − M c ) . (9)A simple example is illustrated in Fig. 3, where the ref-erence annotation contains three sound events: dog , car horn and child , while the system output contains two: dog and cat ,at their respective positions. The joint evaluation will comparefor correctness of both the labels and the locations, thereforeit will characterize the localization error in the “dog”-“dog”pair, and consider the other events as errors (false positives andfalse negatives). Note that with the above setup false negativesdo not depend on the threshold, while false positive include Car hornDogDog CatChild d1 d2 FP Ɵ Ɵ FNFNTP

Microphone

Fig. 3. Example reference and predicted sound events and locations. Circlesdenote reference sounds, rectangles system output. both the extraneous predictions, and associated predictions thatdid not pass the threshold. Based on the above, we are ableto measure location-aware detection metrics such as precision,recall, F1-score, or error rates.Regarding class-aware localization, we compute the local-ization error ( LE c ) and localization recall ( LR c ) of Eq. (1–2)only between predictions and references of class cLE c = || A c (cid:12) D c || || A c || (10) LR c = (cid:80) l || A ( l ) c || (cid:80) l N ( l ) c . (11)The overall class-dependent LE CD , LR CD , are computed asthe class means of Eq. (10–11) LE CD = 1 C · L (cid:88) c (cid:88) l LE ( l ) c (12) LR CD = 1 C (cid:88) c LR c . (13)In some applications it may be of interest to have both class-dependent, and thresholded localization metrics, similar toEq. (4–6). In the joint measurement results of this study we usethe non-thresholded versions of Eq. (10–11). It is also worthnoting that different thresholds per class Θ c may be accom-modated in the above framework, to reﬂect different spatialtolerances for certain classes depending on the application. B. Segment-based measurement

Segment-based metrics are commonly used in sound eventdetection. Segment-based detection metrics generalizes theframe-based binary activity of sound events to its correspond-ing activity at segment-level. In [42], this generalization isdone by considering an event to be active at a segment-level, if it is active in atleast one frame within the segment.Similar generalization of the localization metrics to a differenttime-scale can be formulated through a spherical mean DoAvector or Cartesian mean positional vector ˆ x of all predictions ˜ x ( l ) of the corresponding event within the segment, beforelocalization errors are measured. Alternatively, the averagelocalization error within a segment can be computed basedon the frame-based pairs of reference and predicted events.Both approaches are introduced and compared in [41] withcomparable results.IV. C HALLENGE RESULTS

Even though the SELD task was introduced in DCASE2019for the ﬁrst time, it attracted a lot of interest and received the second highest number of submissions among other tasks. Intotal 58 systems were submitted, from a total of 22 teamsconsisting of in total 65 members. The participants wereafﬁliated with 16 universities and 8 companies.

A. Overall challenge results

The overall results of the challenge are presented in TableI. Only the best system of each team is presented, andthe systems are ordered by their ofﬁcial challenge rank asdescribed in Section II-C. In addition to the results dis-played on the challenge webpage, this table includes the95% conﬁdence intervals for each separate metric, estimatedusing the jackknife procedure presented in [1]. The methodis a resampling technique that estimates a parameter from arandom sample of data for a population using partial estimates.Conﬁdence intervals by jackkniﬁng are coarse approximations,but applicable in cases where the underlying distribution ofthe parameter to be estimated is unknown. In our case theparameters are metrics that depend on individual combinationsof active sounds at each time, and the jackknife methodallows estimating the conﬁdence intervals without making anyassumption on their distribution. The partial estimates for allmetrics were calculated in a leave-one-out manner, excluding,in turns, one audio ﬁle from the evaluation set.Among the 22 submitted systems, 17 of them ranked higherthan the baseline system using the ofﬁcial ranking method. Interms of the individual metrics, 17 systems had better ER and F -scores than the baseline, with the best ER and F -scoresof 0.06 [17], [18] and 96.7% [18] respectively. Similarly, 18systems had better LE and 14 systems had higher ECR , withthe best LE of 2.7 ◦ [22] and ECR of 96.8% [15].The top-10 systems of Table I are illustrated with respect todetection metrics in Fig. 4a and localization metrics in Fig. 4b.The best system in both these plots is in their correspondingtop left corner. We observe that the ranking order of thesubmitted systems are different for detection and localizationmetrics. For instance, the best system according to detectionmetrics -

He THU [18] (Fig. 4a top-left corner), fairs poorlyin DoA estimation compared to the other top-10 systems, andhence achieves an overall rank of four. Similarly, although

Chang HYU [22] achieved the best LE among the top-10systems, its detection performance was among the poorest oftop-10 systems and hence achieved a rank of eight. In general, ER and F -scores of event detection are correlated, and henceall the submitted systems are observed along the diagonal. Thisdiagonal behavior is not observed with the localization metricsas LE and ECR are not directly, or only weakly, correlated.All systems had at least one deep learning component intheir approach. Speciﬁcally, apart from [33] and [35] thatemployed a CNN architecture with no recurrent layers theremaining 20 systems employed different versions of thebaseline CRNN architecture as one of their components.Three of the submitted systems employed parametric DoAestimation [20], [29], [32] approach along with CRNN-basedclassiﬁcation. The best parametric based DoA approach [20]achieved the 6th position. Among the DNN-based SELD meth-ods, nine of them employed multi-task learning [44] for joint

TABLE IC

HALLENGE RESULTS OF SUBMITTED SYSTEMS . T

HE RANK IS BASED ON THE CUMULATIVE RANK BASED ON THE FOUR CALCULATED METRICS . B

ESTSYSTEM PER TEAM ACCORDING TO THE OFFICIAL CHALLENGE RANKING . B

EST SCORE INDICATED FOR THE SEPARATE METRICS .Rank System ER F1 LE ECR1 Kapka SRPOL 2 [15] 0.08 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± UMMARY OF SUBMITTED SYSTEMS . T

HE RANK IS BASED ON THE CUMULATIVE RANK BASED ON THE FOUR CALCULATED METRICS . B

EST SYSTEM PERTEAM ACCORDING TO THE OFFICIAL CHALLENGE RANKING .System Audio Features Classiﬁer Multi-task1 Kapka SRPOL 2 [15] AMB Phase and magnitude spectra CRNN × × (cid:88) × × × × × ×

10 Park ETRI 1 [24] Both Log-mel and intensity vectors CRNN, TrellisNet (cid:88)

11 Leung DBS 2 [25] AMB Log-magnitude, phase, and cross spectra CRNN ensemble (cid:88)

12 Grondin MIT 1 [26] MIC Phase and magnitude spectra, GCC and TDOA CRNN ensemble ×

13 ZhaoLu UESTC 1 [27] MIC Log-mel spectra CRNN (cid:88)

14 Rough EMED 2 [28] MIC Phase and magnitude spectra CRNN ×

15 Tan NTU 1 [29] MIC Log-mel spectra and GCC ResNet RNN, parametric DoA ×

16 Cordourier IL 2 [30] MIC Phase and magnitude spectra, and GCC CRNN ensemble (cid:88)

17 Krause AGH 4 [31] AMB Phase and magnitude spectra CRNN ensemble (cid:88)

18 Adavanne TAU FOA [14] AMB Phase and magnitude spectra CRNN (cid:88)

19 Perezlopez UPF 1 [32] AMB Log-mel spectra CRNN, parametric DoA ×

20 Chytas UTH 1 [33] MIC Raw audio and power spectra CNN ensemble ×

21 Anemueller UOL 3 [34] AMB Group-delay and magnitude spectra CRNN (cid:88)

22 Kong SURREY 1 [35] AMB Magnitude spectra CNN (cid:88)

23 Lin YYZN 1 [36] AMB Phase and magnitude spectra CRNN (cid:88)

SED and DoA estimation. The remaining systems, includingthe top ranked system [15], employed separate networks forSED and DoA estimation, and performed engineered data-association of their respective outputs. Finally, there was nosigniﬁcant improvement in SELD performance with the choiceof either of the two audio formats in the dataset. Among thetop 10 ranked systems, four of them used the microphonearray format, three used the Ambisonic format, and the restused both formats as input.

B. Analysis of individual systems

The system characteristics of all the submissions are sum-marized in Table II. A more detailed analysis of some of thesystems follow, along with a summary of the most prominentarchitectural, input feature, or training characteristics.Kapka & Lewandowski (

Kapka SRPOL ) [15] was the topperforming system of the challenge, with very high perfor-mance in both localization and detection. There was minimalfeature engineering and the pure magnitude and phase spec-trograms of the FOA format were used as input. However, ƏĺƏѵ ƏĺƏѶ ƏĺƐƏ ƏĺƐƑ ƏĺƐƓ ƏĺƐѵ ƏĺƐѶ uuouu-|; ƏĺƖƏƏĺƖƐƏĺƖƑƏĺƖƒƏĺƖƓƏĺƖƔƏĺƖѵƏĺƖƕ Ɛ v 1 o u ; -uh$!*; ;;$&];m$& -om+-v7-$$;$& -o"uu;_-m]+&-rh-"! !-mf-m$& ƐƏ ƒ ƔѵƕƓ Ƒ ѶƐ Ɩ (a) Detection results Ɣ ƐƏ ƐƔ ƑƏ ƑƔ ƏĺѶѵƏĺѶѶƏĺƖƏƏĺƖƑƏĺƖƓƏĺƖѵ ! -uh$! *; ;;$& ];m$&-om+-v7-$$ ;$&-o"uu;_-m]+& -rh-"!!-mf-m$& ƐƏ ƒƔ ѵ ƕ ƓƑѶ Ɛ Ɩ (b) Localization resultsFig. 4. Separately calculated detection and localization performance of top 10 systems (best system per team). The ofﬁcial rank of the systems is indicatedin the center of the marker for each scatter plot. the approach was highly coupled to the task, by splittingit into four well deﬁned subtasks and then dedicating oneCRNN model to infer each one of them. The subtasks were:a) estimation of the number of sources, b) estimation of DoAfor an active source, c) estimation of a second DoA in the casethat two simultaneous events are detected, d) classiﬁcation ofevents which number equals the number of detected sources.Well-engineered post-processing of outputs, from source countto localization to event durations to classiﬁcation, coupledthe method to prior knowledge of the dataset and ensuredconsistent association and information ﬂow between modules.It is worth noting that their architecture seems able to resolvetwo simultaneous instances of the same class at differentdirections. Since the architecture relied on prior knowledge,such as a maximum of two simultaneous sources and discreteDoAs at 10 ◦ intervals, it was not as general as most of theother approaches.Cao et al. ( Cao Surrey ) [16], had the second best performingsystem, following the ﬁrst one closely. However, the authorskept the general SELDnet architecture, and advanced it witha number of informed domain-speciﬁc choices. The mostimportant ones seem to be improved input features, and disas-sociating the detection and localization losses by duplicatingthe SELDnet and training each clone for SED and localizationseparately, while using the ground truth SED activations asmasks on the localization loss. Additionally, they used bothFOA and MIC input, and ensemble averaging. According toablation studies in [16], the better input features and the two-stage training architecture have a drastic effect in performance.The system of Xue et al. (

Xue JDAI ) [17] outperformedthe ﬁrst two in detection results, but had lower localization performance resulting in the third best average rank. Itssuccess seem to be a combination of multiple spectral andspatial features and elaborate post-processing. DoA estimationfrom the CRNN model was also abandoned in favour ofa traditional SRP estimation, reﬁned by the former only inthe case of simultaneous events. Additionally, separate CNNbranches were used for SED and localization features, beforebeing merged at the recursive layers.The fourth best system of Zhang et al. (

He THU ) [18]follows the same architecture as [16]. It had the best SEDperformance overall, but its localization accuracy was onlymarginally better than the baseline. The large differencecompared to the second system may be due to the basicspectrogram feature for localization, instead of the moreeffective directional features used in [16]. On the other hand,the higher detection performance may be attributed to theSpecAugment [45] data augmentation strategy used. The samearchitecture was also employed by the ﬁfth best system of Jeeet al. [19], aiming to improve its performance. They introduceda number of incremental modiﬁcations to the SED features,CRNN layers, pooling, and activation functions, along witha mixup [46] data augmentation strategy, without, however,achieving better results at the challenge evaluation.Nguyen et al. (

Nguyen NTU ) [20] took the concept of inde-pendent localization and detection to its extreme, performingthem separately, and then associating DoAs to overlappingdetected events randomly. Good overall performance broughtthem to the sixth place. Note that their approach exploits thefact that detection and estimation performance are evaluatedindependently and correct associations between the two arenot measured, as discussed in the next sections.

The next best system of Mazzon et al. (

MazzonYasuda NTT )[21] was also based on the architecture of the second bestsystem [16], trying to improve on it with a Resnet networkreplacing CNNs, an elaborate ensemble strategy, and, mostimportantly, an original spatial data augmentation approachexploiting the rotation and reﬂection properties of the sphericalharmonic bases encoding the sound ﬁeld in Ambisonics [47].The authors limited the input features to only GCC-PHAT forboth FOA and microphone array signals, potentially limitingtheir effectiveness for the FOA set which encodes DoA infor-mation by amplitude differences.Noh et. al. (

Chang HYU ) [22] added an overall soundactivity detection model on top of the SED one. Two additionalindependent CRNN models were trained to detect presence ofone or two events respectively, using cochleagram features asinput. Their binary outputs were used to select whether none,one, or two event classes with the highest probabilities ofthe dedicated CRNN SED model were outputted. The authorsemployed just a CNN network for DoA estimation, performedas a classiﬁcation task on 324 classes, inferred from the gridof potential DoAs in the dataset. Interestingly, their modelachieved the lowest localization error in the challenge. Thatmay be attributed to their DoA classiﬁcation matching theDoA discretized grid in the dataset, along with their spatialdata augmentation technique, mixing recordings from non-overlapping events to generate additional overlapping seg-ments for training. No information was provided on how or ifDoAs were associated with events, and from further analysison the following sections, we assume the association wasdone randomly, as in [20]. The same approach of independentSED and localization networks, a classiﬁcation-based DoAestimation, and random association between the two wasfollowed by the next best performing system of Ranjan et al.(

Ranjan NTU ) [23]. Additionally, the authors replaced CNNlayers with Resnets in the typical CRNN networks followedby most participants.The tenth-best performing system of Park et al. (

Park ETRI )[24] attempted to combine the success of the two-stage trainingapproach [16] with the assumed consistency of joint-modeling.They performed two stages of weight transfer from separatelytrained SED and DoA estimation networks, into a new networkwith a SED and DoA branch trained with a combined detectionand localization loss, as in the baseline SELDnet. Additionallythey experimented with TrellisNet layers instead of RNNs, andalternative activation functions.We note some interesting investigations in the rest of thesubmitted systems. Grondin et al. (

Grondin MIT ) [26] usedone CRNN for each microphone pair in the array format,performing joint event detection and localization. The networkwas trained to output intermediate TDOA values, mappedafterwards to DoAs. Tan et al. (

Tan NTU ) [29] was one ofthe four systems that did not use machine learning for DoAestimation, computing time-domain cross-correlations betweenmicrophone pairs and their respective TDOA, and convertingit to a DoA by a least-squares geometric ﬁt. Krause andKowalczyk (

Krause AGH ) [31] explored various combinationsof layers processing localization and SED features beforefusion, as well as early branching for the two tasks.

Grondin MIT [26] showed similar considerations on the fusion of inputfeatures, since the approach of the baseline stacking phase andmagnitude spectrograms into a single tensor could be subop-timal. Chytas and Potamianos (

Chytas UTH ) [33] proposed toperform SELD directly from downsampled audio waveforms,with some additional help for SED using power spectrograms.Even though their CNN-only approach underperformed onSED, it showed that competitive localization can be achievedusing DNNs directly on time-domain multichannel audio.Finally, a special mention should go to the system byPerez et al. (

PerezLopez UPF ) [32] since, along with thebest performing system of [15], it was the only other systemfollowing a localize-before-detect paradigm. Their approachwas based on model-based DoA estimation on the FOAformat, determination of the number of sources based on theDoA estimates, determination of the event onset/offset, andbeamforming towards the prominent DoAs. The beamformedsignals, being essentially estimates of separated event signals,were fed to a CRNN classiﬁer for SED. Contrary to themajority of submissions in the challenge, such an architectureis capable of detecting simultaneous instances of the sameclass localized at different directions.

C. Discussion on submitted systems

One obvious observation on the results is that the SELDnetbaseline, as implemented for the challenge, had a suboptimalperformance compared to the majority of the submissions. Aninitial weakness seems to be the input features. A numberof submissions indicated that by switching to features withmore concentrated information on each of the two tasks,detection (log-mel spectra) or localization (GCC-PHAT arrays,active intensity vectors), improved performance signiﬁcantly.These three sets of features were the most popular overall inthe top submissions, with only the third best system relyingon multiple other types of multichannel spectra. It has tobe noted though, that the top system [15] used the rawmultichannel phase and magnitude spectrograms, indicatingthat it is possible to perform SELD succesfully with suchlower level features, but with model architectures exploitingprior knowledge and coupled tightly to the task.The most popular network architecture and training choicesseem to be the ones introduced by Cao et al. [16]. Essentially,their work disassociate the joint cost function combining SEDlosses and localization losses as realized in the baseline, andtrain individual models for each task. The SED and DoAestimates are then associated through a training strategy, orassigned randomly between them [20], [22], [23]. It has to benoted that such random association takes advantage of the factthat detection and localization were evaluated independentlyin the challenge, and would not be a good strategy in practice.Ranjan et al. [23] compared the two-stage architecture versusjoint-modeling, with clearly improved results with the former.However, it is worth noting that two systems in the top tenplaces had a single network performing joint-modeling [17],[24], one of them being third best [17].The SELD paradigm proposed by the SELDnet baseline,where one DoA output is tied to each class, followed by most submissions, including multi-stage approaches [16], [22], isforcing a detect-before-localize approach, limiting the outputof the system to only one localized event per class, evenin the presence of two same-class instances. Systems thatwere training an independent localization network as a DoAclassiﬁcation task, were not addressing that problem sinceassociation of DoAs to detected classes was ambiguous. Theonly two submissions that followed a localize-before-detectapproach, using localization information to determine numberand DoAs of events independently of their class, and thenpassing that information to classiﬁers [15], [32] were turningthe class-based outputs into event-based outputs, circumvent-ing the same-class multi-instance SELD problem.Certain architectural or training choices were speciﬁc tothe localization task. Some of the submissions treated DoAestimation as a classiﬁcation task [22], [23], e.g. similar toother DNN-based localization works [11]–[13], instead of theregression format of the baseline. Xue et al. [17] trained bothDoA output formats simultaneously. However it has to benoted that the systems who relied only on DoA classiﬁcationwere taking advantage of the the small set of 324 ﬁxed DoAsembedded in the dataset. A dataset with a much more densespatial resolution of possible DoAs, a continuous range ofDoAs, or moving sources, may have needed a much largernumber of classes to be modeled effectively (e.g. 2522 discreteangles for a resolution of 5 ◦ in azimuth and elevation coveringthe sphere). Moreover, classiﬁcation-based DoA estimationwas found successful in two-stage systems, training inde-pendently a DoA network. Joint-modeling of SELD basedcompletely on classiﬁcation, as pioneered by Hirvonen [8],seems feasible for a small number of classes and directions.Otherwise, since such a classiﬁer would require no. of DoAclasses × no. of event classes outputs, with only a smallnumber of them being positive at each frame, would posechallenges of an imbalanced dataset. Additionally, trainingsuch a large number of classes requires an impractically hugedataset with enough examples for each class. On the otherhand, the format of one DoA-regression-output per soundevent class does not suffer from those limitations, but it isunable to detect multiple instances of the same class beingactive at different directions.Finally, some of the submissions aimed for a parametricDoA estimation instead of a trainable DNN model [17], [20],[29], [32], including the third best system of Xue et al.[17]. Parametric DoA estimation has the advantage that itdoes not require training and that it is possible to gener-alize to completely unseen environments, since it requiresonly knowledge of the directional array response. Moreover,Nguyen et al. [20] had one of the smallest DoA errors in thechallenge. However, it can be more susceptible to reverberationthan DNN approaches, if not accompanied with additionalprocessing, such as detection of single-source dominated time-frequency blocks [20]. Interestingly, Xue et al. [17] did notutilize the provided theoretical steering vectors of the spatialformat, but estimated them directly from the data. V. R EEVALUATION OF CHALLENGE ENTRIES USING JOINTMETRICS

We evaluate all the systems submitted to DCASE 2019Challenge Task 2 using the proposed joint measures in orderto determine the most suitable single metric that encompassesall aspects when representing system performance in a singlenumber. We compute all metrics in one-second segments, andevaluate the location-aware detection metrics with an angularerror threshold of 10 and 30 degrees. The results are presentedin Table III, in order of the ofﬁcial challenge rank. Conﬁdenceintervals for all metrics were calculated according to thejackknife procedure by leaving out one ﬁle at a time for thepartial evaluation. New cumulative ranks are estimated similarto the ofﬁcial ranks based on the proposed joint measures forthe purpose of system comparison. The top 10 systems fromTable III are also presented in Fig. 5.

A. Analysis of systems

The independent localization and evaluation metrics(

ECR, LE, F , ER ) are more permissive than the joint ones( LR CD , LE CD , F ◦ , ER ◦ ). We chose a threshold of ◦ for a relatively strict localization criteria with respect tothe average localization error of the systems presented inTable II. A ranking based on the new metrics is expectedto be different at least for some of the submissions. TableIII presents new ranks computed between class-dependentlocalization ( LR CD , LE CD ) and location-dependent classiﬁ-cation ( F ◦ , ER ◦ ). Systems with equal ranks indicate thatthe sum of the individual ranks for each pair of metricswas the same. The greatest changes on the top ten systemsseem to be induced by the location-dependent classiﬁcation( F ◦ , ER ◦ ), which is to be expected since it penalizesinadequately localized detections with a strict threshold of ◦ .In general, it can be observed that submissions whichemployed separate localization and detection systems and didnot handle association of the two properly were likely toslip in their ranks. This is especially evident on the systemsthat assigned randomly DoAs to detections, such as NguyenNTU [20], and

Ranjan NTU [23], including the best localiza-tion method of

Chang HYU [22]. Their association problemsare revealed both by their large drop in detection scores( F ◦ , ER ◦ ), and with the large error increase between theiroriginal LE and the class-dependent one LE CD .Methods that performed signiﬁcantly better detection thanlocalization, such as Xue JDAI [17],

He THU [18], and

LeungDBS [25] also slipped in their ranks. This is mostly due tothree of the original metrics ( F , ER, ECR ) being directlyassociated to detection performance, boosting their overallrank. This imbalance is diminished with the new metrics,resulting in the drop of the aforementioned systems.Among the methods that performed proper data association,the ones who had better localization scores [21], [24], [26],[27], [32], [33] and not the best detection scores improvedin their ranks, due to the detection bias of ofﬁcial rankingsmentioned above. Two examples worth mentioning are thoseof Park ETRI [24], whose joint training strategy seemed tobeneﬁt when evaluated jointly, taking them to 4th place, and TABLE IIIE

VALUATION OF

DCASE 2019

SUBMISSIONS USING THE JOINT METRICS CALCULATED IN ONE SECOND SEGMENTS . B

EST SYSTEM PER TEAM , IN ORDEROF THE OFFICIAL CHALLENGE RANKING .Ofﬁcialrank System LE CD LR CD Rank ER ◦ F ◦ Rank ER ◦ F ◦ ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ƏĺƑ Əĺƒ ƏĺƓ ƏĺƔ Əĺѵ Əĺƕ ! ƐƏŦ ƒƏƓƏƔƏѵƏƕƏѶƏ Ɛ ƐƏ Ŧ -rh-"! -o"uu;*; ;$&;;$& ];m$&-om+-v7-$$_-m]+& !-mf-m$&-uh$! Ɛ Ƒ ƒ ƓƔ ѵƕ ѶƖ ƐƏ Ɣ ƐƏ ƐƔ ƑƏ ƑƔ ѶѶѶƖƖƏƖƐƖƑƖƒƖƓƖƔƖѵ ! -rh-"!-o"uu; *; ;$&;;$& ];m$&-om+-v7-$$ _-m]+&!-mf-m$&-uh$! Ɛ Ƒ ƒ ƓƔ ѵƕ ѶƖ ƐƏ Fig. 5. Joint detection and localization performance of top 10 systems (best system per team). The ofﬁcial rank of the systems is indicated in the center ofthe marker for each scatter plot.

PerezLopez UPF [32], which leaped from 19th place belowthe baseline to 7th place, both when evaluated with the strictlocation-dependent detection ( F ◦ , ER ◦ ).Even though the rank for the more permissive ◦ location-dependent detection metrics ( F ◦ , ER ◦ ) is not displayedin Table III, it is closer to the original challenge ranking.This is explained both by the more relaxed threshold, whichas it becomes larger the metrics approach their independentdetection counterparts, and by the fact that the threshold islarger than the average LE CD of about ◦ between systems. B. Metrics analysis

The analysis of the metrics with respect to each other isperformed using Spearman’s rank correlation coefﬁcient andincludes all submissions to the task. Our purpose is to deter-mine which single metric is capable of representing the desiredproperties of the system in terms of localization and detection,instead of using the compound of four separate metrics as donein the challenge ranking. We rank all submissions using eachmetric separately and evaluate how correlated the differentrankings are. Correlation values are presented in Fig. 6. The u-mh ! Ɛ ! Ő=ő ! Ő=ő ! Ɛ ƐƏŦ Ɛ ƒƏŦ ! ƐƏŦ ! ƒƏŦ u-mh! Ɛ! Ő=ő ! Ő=ő! Ɛ ƐƏŦ Ɛ ƒƏŦ ! ƐƏŦ ! ƒƏŦ ƐĺƏƏƏĺѵƐ ƐĺƏƏƏĺѶƖ ƏĺƓƑ ƐĺƏƏƏĺƖƓ ƏĺƓƐ ƏĺѶƒ ƐĺƏƏƏĺƖѵ ƏĺƓƔ ƏĺѶƒ ƏĺƖƖ ƐĺƏƏƏĺѵƒ ƏĺѶѵ Əĺƒѵ ƏĺƔƐ ƏĺƔƑ ƐĺƏƏƏĺѵƒ ƏĺѶѵ Əĺƒѵ ƏĺƔƐ ƏĺƔƑ ƐĺƏƏ ƐĺƏƏƏĺѶѵ ƏĺƒƏ ƏĺƕƖ ƏĺƖƒ ƏĺƖƐ ƏĺƓƒ ƏĺƓƒ ƐĺƏƏƏĺƖƒ ƏĺƓƏ ƏĺѶƒ ƏĺƖѶ ƏĺƖƕ ƏĺƔƐ ƏĺƔƏ ƏĺƖƔ ƐĺƏƏƏĺƕƐ ƏĺѶƓ ƏĺƓѶ ƏĺƔѶ ƏĺѵƏ ƏĺƖƕ ƏĺƖƕ ƏĺƔƒ ƏĺƔƖ ƐĺƏƏƏĺƕƔ ƏĺѶƐ ƏĺƓѶ Əĺѵƕ ƏĺѵѶ ƏĺƖѵ ƏĺƖѵ ƏĺѵƐ ƏĺѵѶ ƏĺƖƔ ƐĺƏƏƏĺƕƒ ƏĺѶƓ ƏĺƔƏ ƏĺѵƑ Əĺѵƒ ƏĺƖѵ ƏĺƖƕ ƏĺƔѵ ƏĺѵƑ ƐĺƏƏ ƏĺƖѵ ƐĺƏƏƏĺѶƐ ƏĺƕƔ ƏĺƔѵ ƏĺƕƔ Əĺƕѵ ƏĺƖƑ ƏĺƖƑ ƏĺѵƖ Əĺƕƕ ƏĺƖƑ ƏĺƖѶ ƏĺƖƓ ƐĺƏƏ o1-Ѵb-|bom ;|;1|bom Ѵ-vvŊ7;r;m7;m|Ѵo1-Ѵb-|bom o1-|bomŊ7;r;m7;m|7;|;1|bom Fig. 6. Correlation between ranking order of submissions according to the different metrics and the ofﬁcial ranking in the challenge. metrics marked with ( f ) are calculated frame-wise (in thiscase 20 ms). Among the four individual metrics ( LE , ECR , F , and ER ), the detection scores ( F and ER ) are highlycorrelated with the ranking, indicating that good detectionperformance was important for obtaining a top rank. Thelocalization error is less correlated with the overall rank.Among the joint metrics, the class-dependent LR CD scoreis highly correlated with the ofﬁcial ranking, more so forthe segment-based than the frame-based measurement. Thisbehavior is noticed in all metrics, with the more permissivemetric being more correlated to the overall rank: a) segment-based LR CD is more correlated to the rank than frame-based LR CD ( f ) , and b) metrics with ◦ threshold are morecorrelated to the rank than metrics with ◦ threshold. Thiscan be explained by the fact that joint metrics ﬁrst perform thedata association between detected and localized sound sources,and the more permissive metrics allow a higher proportion ofmatches, which in turn is closer to the matching done by thedetection-only and separation-only metrics.We observe similar behavior between metric pairs with andwithout data association: a) correlation between localization-only metrics LE and ECR is moderate, and similar tothe one between LE CD and LR CD . b) High correlation isobserved between detection-only ER and F , and same forthe corresponding data associated versions. On the other hand,the correlation between detection-only ER and its counterparts ER ◦ or ER ◦ is moderate. Similar behaviour is observedbetween F and its counterparts F ◦ or F ◦ . Basically,the data association makes the metrics less permissive (ina similar manner as the higher correlation for the morepermissive threshold of ◦ than for ◦ ).Among the proposed joint-metrics, LR CD has the bestcorrelation (0.93) with the ofﬁcial DCASE2019 rankings, thatis presumed to be a good approximation of the overall systemperformance. However, LE CD is only moderately correlated(0.50) with LR CD , hence, selecting an SELD model based onjust LR CD might not always guarantee the best LE CD . Onthe other hand, the location-aware detection metrics are highly correlated with each other ( ER ◦ vs. F ◦ or ER ◦ vs. F ◦ ), and have moderately high (0.71-0.81) correlation withthe ofﬁcial rank. Furthermore, for a given distance threshold,the error rate metrics are more correlated to the ofﬁcial rankthan the F1-scores, and they are also highly correlated with LE CD . Hence, choosing a SELD model based on a singlemetric of error-rate ( ER ◦ / ER ◦ ) will not only help inselecting a good SELD model, but will also guarantee a goodcounterpart F1-score and a low LE CD .VI. C ONCLUSIONS AND FUTURE WORK

This work presented and analyzed the submissions ofDCASE2019 SELD challenge, with a discussion on generaland individual characteristics of the systems, how those re-ﬂected on their performance, and a comprehensive evaluation.This ﬁrst challenge revealed a strong community focusedon the joint localization and detection, coming both fromthe audio machine learning and the array signal processingﬁelds. Compared to the few studies before the challenge, theadvances demonstrated by the participants were strong, interms of SELD modeling and engineering, and in terms ofraw performance surpassing the baseline by far, and reachingalmost perfect localization and detection scores.The very high performance of the top ranked systems, of afew degrees of average localization error and more than 95 % F1 score, additionally reveals, to some extent, that the tasksetup was not challenging enough for them. This can be at-tributed to the dataset itself. The simulated spatial recordings,even though acoustically realistic, contained only static eventswell separated between them by at least 10 ◦ . Furthermore, theroom IRs were captured in large open spaces and at fairlyclose distances from the microphone resulting in high direct-to-reverberant ratios, and the ambient noise was added at avery high SNR. As a consequence, the spatial and spectralcharacteristics of the events were not signiﬁcantly corruptedby them, and the methods had to learn mostly a model of thedirectional array response to infer location. Such conditionsreﬂect, of course, only a limited subset of real spatial sound scenes, and of the associated challenges for SELD systems.Most of these considerations were addressed in the recentdataset for the new DCASE2020 challenge [48]. A signiﬁcantadvance is the introduction of reverberant moving sources,still based on captured RIRs from real spaces [48], [49].Moreover, ambient noise occurs at varying levels, reverberantconditions are stronger and more varied, and event locationsdo not occur in a sparse regular grid but can vary more orless continuously. Hence, after DCASE2019 conﬁrmed thatinformed engineering can solve the SELD task successfullyunder the restricted conditions of its dataset, the DCASE2020challenge focuses on presenting more challenging evaluationconditions closer to reality.Along these lines, we can envision some of the challengesin a SELD task that have not been addressed yet. In terms ofthe spatial properties of the scene, two points not addressedyet are moving receivers (together with moving sources), anddirectional interferes which represent clearly localized soundsof unknown types. Both of these properties are expected tobe introduced in the upcoming challenges, after DCASE2020.Beyond spatial characteristics, an evolution of the challengeand its datasets would consider the overall spatiotemporalscene consistency. At the moment events are randomly chosenand spatialized. A realistic scene generator should spatializeevents that ﬁt a given space at their most probable locations,while respecting real-life co-occurence probabilities. Suchconsistency between space, sound source locations, respectivesound emitting actions, and the sound events associated withall the above remains a topic for future research.R EFERENCES[1] A. Mesaros, A. Diment, B. Elizalde, T. Heittola, E. Vincent, B. Raj, andT. Virtanen, “Sound event detection in the DCASE 2017 Challenge,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 27, no. 6, pp. 992–1006, 2019.[2] T. May, S. Van de Par, and A. Kohlrausch, “A binaural scene analyzerfor joint localization and recognition of speakers in the presence ofinterfering noise sources and reverberation,”

IEEE Transactions onAudio, Speech, and Language Processing , vol. 20, no. 7, pp. 2016–2030,2012.[3] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti,“Scream and gunshot detection and localization for audio-surveillancesystems,” in

IEEE Conference on Advanced Video and Signal BasedSurveillance , London, UK, 2007, pp. 21–26.[4] T. Butko, F. G. Pla, C. Segura, C. Nadeu, and J. Hernando, “Two-source acoustic event detection and localization: Online implementationin a smart-room,” in , Barcelona, Spain, 2011, pp. 1317–1321.[5] R. Chakraborty and C. Nadeu, “Sound-model-based acoustic sourcelocalization using distributed microphone arrays,” in

IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) ,Florence, Italy, 2014, pp. 619–623.[6] K. Lopatka, J. Kotus, and A. Czyzewski, “Detection, classiﬁcation andlocalization of acoustic events in the presence of background noise foracoustic surveillance of hazardous situations,”

Multimedia Tools andApplications , vol. 75, no. 17, pp. 10 407–10 439, 2016.[7] C. Grobler, C. P. Kruger, B. J. Silva, and G. P. Hancke, “Soundbased localization and identiﬁcation in industrial environments,” in ,Beijing, China, 2017, pp. 6119–6124.[8] T. Hirvonen, “Classiﬁcation of spatial audio location and content usingconvolutional neural networks,” in

Audio Engineering Society Conven-tion 138 , Warsaw, Poland, 2015.[9] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound eventlocalization and detection of overlapping sources using convolutionalrecurrent neural networks,”

IEEE Journal of Selected Topics in SignalProcessing , vol. 13, no. 1, pp. 34–48, 2018. [10] A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen,and M. D. Plumbley, “Detection and classiﬁcation of acoustic scenes andevents: Outcome of the dcase 2016 challenge,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 26, no. 2, pp. 379–393, Feb 2018.[11] S. Adavanne, A. Politis, and T. Virtanen, “Direction of arrival estima-tion for multiple sound sources using convolutional recurrent neuralnetwork,” in ,Rome, Italy, 2018, pp. 1462–1466.[12] L. Perotin, R. Serizel, E. Vincent, and A. Gu´erin, “CRNN-based multipleDoA estimation using acoustic intensity features for Ambisonics record-ings,”

IEEE Journal of Selected Topics in Signal Processing , vol. 13,no. 1, pp. 22–33, 2019.[13] S. Chakrabarty and E. A. Habets, “Multi-speaker DOA estimation usingdeep convolutional networks trained with noise signals,”

IEEE Journalof Selected Topics in Signal Processing , vol. 13, no. 1, pp. 8–21, 2019.[14] S. Adavanne, A. Politis, and T. Virtanen, “A multi-room reverber-ant dataset for sound event localization and detection,” in

Detectionand Classiﬁcation of Acoustic Scenes and Events 2019 Workshop(DCASE2019) , New York, NY, USA, 2019, pp. 10–14.[15] S. Kapka and M. Lewandowski, “Sound source detection, localizationand classiﬁcation using consecutive ensemble of CRNN models,” in

De-tection and Classiﬁcation of Acoustic Scenes and Events 2019 Workshop(DCASE2019) , New York, NY, USA, 2019, pp. 119–123.[16] Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. Plumbley,“Polyphonic sound event detection and localization using a two-stagestrategy,” in

Detection and Classiﬁcation of Acoustic Scenes and Events2019 Workshop (DCASE2019) , New York, NY, USA, 2019, pp. 30–34.[17] W. Xue, T. Ying, Z. Chao, and D. Guohong, “Multi-beam and multi-tasklearning for joint sound event detection and localization,” DCASE2019Challenge, Tech. Rep., 2019.[18] J. Zhang, W. Ding, and L. He, “Data augmentation and prior knowledge-based regularization for sound event localization and detection,”DCASE2019 Challenge, Tech. Rep., 2019.[19] P. Pratik, W. J. Jee, S. Nagisetty, R. Mars, and C. Lim, “Soundevent localization and detection using CRNN architecture with Mixupfor model generalization,” in

Detection and Classiﬁcation of AcousticScenes and Events 2019 Workshop (DCASE2019) , New York, NY, USA,2019, pp. 199–203.[20] T. N. T. Nguyen, D. L. Jones, R. Ranjan, S. Jayabalan, and W. S. Gan,“DCASE 2019 Task 3: A two-step system for sound event localizationand detection,” DCASE2019 Challenge, Tech. Rep., 2019.[21] L. Mazzon, M. Yasuda, Y. Koizumi, and N. Harada, “Sound eventlocalization and detection using FOA domain spatial augmentation,”DCASE2019 Challenge, Tech. Rep., 2019.[22] K. Noh, C. Jeong-Hwan, J. Dongyeop, and C. Joon-Hyuk, “Three-stage approach for sound event localization and detection,” DCASE2019Challenge, Tech. Rep., 2019.[23] R. Ranjan, S. Jayabalan, T. N. T. Nguyen, and W. S. Gan, “Soundevent detection and direction of arrival estimation using residual net andrecurrent neural networks,” in

Detection and Classiﬁcation of AcousticScenes and Events 2019 Workshop (DCASE2019) , New York, NY, USA,2019, pp. 214–218.[24] S. Park, W. Lim, S. Suh, and Y. Jeong, “Trellisnet-based architecturefor sound event localization and detection with reassembly learning,”in

Detection and Classiﬁcation of Acoustic Scenes and Events 2019Workshop (DCASE2019) , New York, NY, USA, 2019, pp. 179–183.[25] S. Leung and Y. Ren, “Spectrum combination and convolutional recur-rent neural networks for joint localization and detection of sound events,”DCASE2019 Challenge, Tech. Rep., 2019.[26] F. Grondin, I. Sobieraj, M. Plumbley, and J. Glass, “Sound eventlocalization and detection using CRNN on pairs of microphones,”in

Detection and Classiﬁcation of Acoustic Scenes and Events 2019Workshop (DCASE2019) , New York, NY, USA, 2019, pp. 84–88.[27] Z. Lu, “Sound event detection and localization based on CNN andLSTM,” DCASE2019 Challenge, Tech. Rep., 2019.[28] P. LiHong, Z. Xue, C. Ping, W. Zhe, and Z. Chun, “Polyphonicsound event detection and localization using a two-stage strategy,”DCASE2019 Challenge, Tech. Rep., 2019.[29] E. L. Tan, R. Ranjan, and S. Jayabalan, “Sound event detection andlocalization using ResNet RNN and time-delay DOA,” DCASE2019Challenge, Tech. Rep., 2019.[30] H. Cordourier-Maruri, P. Lopez Meyer, J. Huang, J. Del Hoyo Ontiveros,and H. Lu, “GCC-PHAT cross-correlation audio features for simultane-ous sound event localization and detection (SELD) on multiple rooms,”in

Detection and Classiﬁcation of Acoustic Scenes and Events 2019Workshop (DCASE2019) , New York, NY, USA, 2019, pp. 55–58. [31] D. Krause and K. Kowalczyk, “Arborescent neural network architecturesfor sound event detection and localization,” DCASE2019 Challenge,Tech. Rep., 2019.[32] A. Perez-Lopez, E. Fonseca, and X. Serra, “A hybrid parametric-deep learning approach for sound event localization and detection,”in Detection and Classiﬁcation of Acoustic Scenes and Events 2019Workshop (DCASE2019) , New York, NY, USA, 2019, pp. 189–193.[33] S. P. Chytas and G. Potamianos, “Hierarchical detection of sound eventsand their localization using convolutional neural networks with adaptivethresholds,” in

Detection and Classiﬁcation of Acoustic Scenes andEvents 2019 Workshop (DCASE2019) , New York, NY, USA, 2019, pp.50–54.[34] E. Nustede and J. Anemuller, “Group delay features for sound eventdetection and localization (Task 3) of the DCASE 2019 challenge,”DCASE2019 Challenge, Tech. Rep., 2019.[35] Q. Kong, Y. Cao, T. Iqbal, W. Wang, and M. D. Plumbley, “Cross-tasklearning for audio tagging, sound event detection and spatial localization:DCASE 2019 baseline systems,” DCASE2019 Challenge, Tech. Rep.,2019.[36] Y. Lin and Z. Wang, “A report on sound event localization anddetection,” DCASE2019 Challenge, Tech. Rep., 2019.[37] I. Trowitzsch, C. Schymura, D. Kolossa, and K. Obermayer, “Joiningsound event detection and localization through spatial segregation,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 28, pp. 487–502, 2019.[38] L. Pi, X. Zheng, C. Zhang, P. Chen, Z. Wang, and X. Li, “U recurrentneural network for polyphonic sound event detection and localization,”in , Chengdu, China, 2020, pp. 86–91.[39] W. Wang, F. Seraj, N. Meratnia, and P. J. Havinga, “Localizationand classiﬁcation of overlapping sound events based on spectrogram-keypoint using acoustic-sensor-network data,” in

IEEE InternationalConference on Internet of Things and Intelligence System (IoTaIS) , Bali,Indonesia, 2019, pp. 49–55.[40] T. N. T. Nguyen, D. L. Jones, and W.-S. Gan, “A sequence matchingnetwork for polyphonic sound event localization and detection,” in

IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , Barcelona, Spain, 2020, pp. 71–75.[41] A. Mesaros, S. Adavanne, A. Politis, T. Heittola, and T. Virtanen, “Jointmeasurement of localization and detection of sound events,” in

IEEEWorkshop on Applications of Signal Processing to Audio and Acoustics(WASPAA) , New Paltz, NY, USA, 2019.[42] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic soundevent detection,”

Applied Sciences , vol. 6, no. 6, p. 162, 2016.[43] H. W. Kuhn, “The Hungarian method for the assignment problem,”

Naval Research Logistics Quarterly , vol. 2, no. 1-2, pp. 83–97, 1955.[44] R. Caruana, “Multitask learning,”

Machine learning , vol. 28, no. 1, pp.41–75, 1997.[45] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, andQ. V. Le, “SpecAugment: A Simple Data Augmentation Method forAutomatic Speech Recognition,” in

Interspeech , Graz, Austria, 2019,pp. 2613–2617.[46] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup:Beyond empirical risk minimization,” in , Vancouver, Canada, 2018.[47] L. Mazzon, Y. Koizumi, M. Yasuda, and N. Harada, “First orderambisonics domain spatial augmentation for DNN-based direction ofarrival estimation,” in

Detection and Classiﬁcation of Acoustic Scenesand Events 2019 Workshop (DCASE2019) , New York, NY, USA, 2019,pp. 154–158.[48] A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant spatialsound scenes with moving sources for sound event localization anddetection,” DCASE2020 Challenge, Tech. Rep., 2020.[49] S. Adavanne, A. Politis, and T. Virtanen, “Localization, detectionand tracking of multiple moving sound sources with a convolutionalrecurrent neural network,” in