Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019
Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, Tuomas Virtanen
11 Overview and Evaluation of Sound EventLocalization and Detection in DCASE 2019
Archontis Politis, Annamaria Mesaros, Sharath Adavanne,Toni Heittola, Tuomas Virtanen,
Abstract —Sound event localization and detection is a novelarea of research that emerged from the combined interest ofanalyzing the acoustic scene in terms of the spatial and temporalactivity of sounds of interest. This paper presents an overviewof the first international evaluation on sound event localizationand detection, organized as a task of DCASE 2019 Challenge.A large-scale realistic dataset of spatialized sound events wasgenerated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions inan unlabeled subset. The overview presents in detail how thesystems were evaluated and ranked and the characteristics of thebest-performing systems. Common strategies in terms of inputfeatures, model architectures, training approaches, exploitationof prior knowledge, and data augmentation are discussed. Sinceranking in the challenge was based on individually evaluatinglocalization and event classification performance, part of theoverview focuses on presenting metrics for the joint measurementof the two, together with a re-evaluation of submissions usingthese new metrics. The analysis reveals submissions with balancedperformance on classifying sounds correctly close to their originallocation, and systems being strong on one or both of the two tasks,but not jointly.
Index Terms —Sound event localization and detection, soundsource localization, acoustic scene analysis, microphone arrays
I. I
NTRODUCTION
Recognition of the classes of sound events in an audiorecording and identification of their occurrences in time is acurrently active topic of research, popularized as sound eventdetection (SED), with a wide range of applications [1]. WhileSED can reveal a lot about the recording environment, thespatial locations of events can bring valuable information formany applications. On the other hand, sound source localiza-tion is a classic multichannel signal processing task, based onsound propagation properties and signal relationships betweenchannels, without considering the type of sound characterizingthe sound source. A sound event localization and detection(SELD) system aims to a more complete spatiotemporalcharacterization of the acoustic scene by bringing SED andsource localization together. The spatial dimension makesSELD suitable for a wide range of machine listening tasks,such as inference on the type of environment, self-localization,navigation without visual input or with occluded targets,tracking of sound sources of interest, and audio surveillance.
This work received funding from the European Research Council under theERC Grant Agreement 637422 EVERYSOUND.A. Politis, A. Mesaros, S.Adavanne, T. Heittola and T. Virtanen arewith the Faculty of Information Technology and Communication Sciences,Tampere University, Finland, e-mail: { archontis.politis, annamaria.mesaros,tuomas.virtanen } @tuni.fi Additionally, it can aid human-machine interaction, in scene-information visualization systems, scene-based deployment ofservices, and assisted-hearing devices, among others.The SELD task was included for the first time in theDetection and Classification of Acoustic Scenes and Events(DCASE) Challenge of 2019 . In addition to the related studiesthat aim at detecting and localizing multiple speakers (seee.g. [2]), only a handful of approaches could be found in theliterature up to that point [3]–[9]. Earlier studies were treatingthe two problems of detection and localization separately,without trying to associate source positions and events. Inthose works, Gaussian mixture models (GMMs) [3], hiddenMarkov models (HMMs) [4], or support vector machines [6]were used for detection, while localization relied on classicarray processing approaches such as time-difference-of-arrival(TDOA) [3], steered-response power [4], or acoustic intensityvector analysis [6]. An early attempt in joining estimates fromthe two problems was presented in [5], where beamformingoutputs from distributed arrays along with an HMM-GMMclassifier are used to build a maximum-a-posteriori criterionon the most probable position in a room of a certain class.During the last decade, deep-neural-networks (DNNs) havebecome the most established method on SED, offering amplemodeling flexibility and surpassing traditional machine learn-ing methods when trained with adequate data [10]. Recently,DNNs have been explored also for machine learning-basedsource localization [11]–[13] with promising results. Hence,DNNs seems like a good candidate for joint modeling oflocalization and detection in the SELD task. The first workswe are aware of this approach are [8] and [9]. Hirvonen[8] proposed to set joint-modeling as a multilabel-multiclassclassification problem, mapping two event classes to eightdiscrete angles in azimuth. A convolutional neural network(CNN) was trained to infer probabilities of each sound classat each position, after which a predefined threshold was usedto decide the final class presence and location. Adavanne et al.[9] proposed as an alternative a regression-based localizationapproach. Modeling was performed by a convolutional and re-current neural network (CRNN) with two output branches, oneperforming SED and the other localization. In the localizationbranch, one regressor per class returned a continuous azimuth-elevation angle. Binary thresholding was used in the detectionbranch to indicate the temporal activity of each class, andthat output was used to gate the respective direction-of-arrival(DoA) output, joining them together during inference. The http://dcase.community/challenge2019/ a r X i v : . [ ee ss . A S ] S e p proposed system, named SELDnet, was extensively comparedagainst other architectures, for a variety of simulated andreal data, and for different array configurations. Note thatboth DNN-based proposals were using simple generic inputfeatures, such as multichannel power spectrograms in [8], andmagnitude and phase spectrograms in [9].Due to its relevance in the aforementioned applications,the SELD task was introduced for the first time in theDCASE 2019 Challenge and received a remarkable numberof submissions for a novel topic. A new dataset of spatializedsound events was generated for the task [14], and a SELDnetimplementation was provided by the authors as a baseline forthe challenge participants . Beyond the works associated withthe challenge [15]–[36], multiple works have followed aimingto address the SELD task in a new way or improve on thelimitations of the challenge submissions [37]–[40].This paper serves three major aims. Firstly, it presentsan overview of the first SELD-related challenge. Secondly,it presents common considerations of SELD systems anddiscusses how these were addressed by the participants, high-lighting novel solutions and common elements of the chal-lenge submissions. Thirdly, the performance of the systems isanalyzed by addressing the issue of evaluating joint detectionand localization. Following the ranking of the systems in thechallenge, we calculate confidence intervals for the challengeevaluation metrics and analyze submissions with respect totheir performance in detection and localization separately.Additionally, we re-evaluate the systems using novel metricsproposed for joint evaluation of localization and detection [41],and investigate correlations between the different metrics andthe ranking of the systems.The paper is organized as follows: Section II presents thetask description, dataset, baseline system, and evaluation, asdefined in the challenge. Section III introduces and formulatesthe joint metrics for evaluation of localization and detection.Section IV presents the analysis of submitted systems, includ-ing the challenge results and detailed systems characteristics.In Section V we re-evaluate the submissions with the newjoint metrics, and analyze the results with a rank correlationanalysis of the different metrics. Finally, Section VI presentsthe concluding remarks on the challenge task organization.II. S OUND EVENT DETECTION AND LOCALIZATION IN
DCASE 2019 C
HALLENGE
The goal of the SELD task, given a multichannel recording,can be summarized as identifying individual sound events froma set of given classes, their temporal onset and offset timesin the recording, and their spatial trajectories while they areactive. In the 2019 challenge, the spatial parameter was thedirection-of-arrival (DoA) in azimuth and elevation, and onlystatic scenes were considered, meaning that each individualsound event instance in the provided recordings was spatiallystationary with a fixed location during its entire duration. Anexample of such a system is shown in Fig. 1. https://github.com/sharathadavanne/seld-dcase2019 Fig. 1. A SELD system example and the baseline of the challenge (SELDnet). A. Dataset
Creating a dataset for a SELD task presents some chal-lenges, reflecting the high complexity of the problem. Ideally,a large range of sound events representative of each soundclass should be reproduced at different times and temporaloverlaps, at an enormous range of different positions in az-imuth, elevation, and possibly distance from the microphones,covering the localization domain of interest. Furthermore, ifthe system is to be robust to varying acoustic conditions anddifferent spaces, all the previous dimensions should be variedacross different rooms. Staging real recordings with this degreeof variability is not practical. Acoustic simulations of spatialroom impulse responses (RIRs) for various rooms shapes andpositions, and then subsequent convolution of the sound eventsamples with them is a viable alternative, explored for examplein [9]. However, such simulators, with simplifications onroom geometry and acoustic scattering behavior, can deviatesignificantly from real spatial RIRs. Additionally, the non-directional ambient noise characteristic of the function of eachspace is present in reality, adding another component theSELD system should be robust to.For DCASE2019, we opted for a hybrid recording-simulation strategy that allowed us to control the detection,localization, and acoustical variability we needed. Real-lifeimpulse responses were recorded at 5 indoor locations in theHervanta campus of Tampere University, at 504 unique com-binations of azimuth-elevation-distance around the recordingposition. The measurements were covering a domain of 360 ◦ in azimuth, -40 ◦ ∼ ◦ in elevation, and 1 ∼
2m in distance.Additionally, realistic ambient noise was recorded on-site withthe recording setup unchanged.Each spatial sound recording was synthesized as a one- minute multichannel mixture of spatialized sound events con-volved with RIRs from the same space, with randomizedonsets and source positions, and with up to two simultaneousevents allowed. The IRs were convolved with the isolatedsound events dataset provided with DCASE 2016 Task 2Sound event detection in synthetic audio , containing 20 eventsamples for each of the 11 event classes. Finally, the recordednatural ambient noise from the same space was added to thesynthesized mixture, at a 30 dB signal-to-noise ratio relativeto the average power of the sound-event mixture at the arraychannels. Each mixture was provided in two different 4-channel recording formats, extracted from the same 32-channelrecording equipment. The first was a tetrahedral microphonearray of capsules mounted on a hard spherical body, while thesecond was the first-order Ambisonics spatial audio format.The two recording formats offer different possibilities in ex-ploiting the spatial information captured between the channels.A development set was available during the challenge , andfor the evaluation set only the audio without labels wasreleased . The development and evaluation sets consist of 400and 100 recordings respectively. A detailed description of thegeneration of the dataset is given in [14]. B. Baseline system
The SELDnet architecture of [9] was provided as thebaseline architecture of the challenge. The rationale behindthis choice was its conceptual and implementation simplicity,and its generality with respect to input features. Furthermore,even though SELDnet was very recent and had the bestresults between the tested methods in its publication, it stillleft a significant margin for improvements with realistic data,both at localization and detection accuracy. The architectureof the system is depicted in Fig. 1. It consists of threeconvolutional layers modeling spatial interchannel and soundevent intrachannel time-frequency representations, followedby two bi-directional recurrent layers with gated recurrentunits (GRU) capturing longer temporal dependencies in thedata. The following two output branches of fully-connectedlayers correspond to the individual tasks of SED and DoAestimation. The SED output is optimized with a cross-entropyloss, while the DoA output is optimized using the meansquared error of angular distances between reference andpredicted DoAs. Contrary to the original SELDnet in [9] whichwas outputting Cartesian vector DoAs, the implementationfor the challenge is returning directly azimuth and elevationangles. The network takes as input multichannel magnitudeand phase spectrograms, stacked along the channel dimension.Reference SED outputs are expressed with one-hot encodingand reference DoAs with azimuth and elevation angles inradians. The network is trained using the Adam optimizerwith a weighted combination of the two output losses, withmore weight given to the localization loss. More details on theSELDnet challenge implementation can be found in [14]. https://archive.org/details/dcase2016 task2 train dev http://dcase.community/challenge2016/task-sound-event-detection-in-synthetic-audio https://zenodo.org/record/2580091 https://zenodo.org/record/3066124 C. Evaluation and ranking
In this first implementation of the challenge the submittedsystems were evaluated with respect to their detection andlocalization performance individually. For SED, the detectionmetrics were the F -score and error rate ( ER ) computedin non-overlapping one-second segments [42]. For DoA es-timation, two additional frame-wise metrics were used. Thefirst is a conventional directional error ( DE ) expressingthe angular distance between reference and predicted DoAs.Since multiple simultaneous estimates are possible, referencesand predictions need to be associated before errors can becomputed. The Hungarian algorithm [43] was used for thatpurpose, and the final DE was computed as the minimumcost association, divided with the number of associated DoAs.Since DE does not reflect on how successfully a systemdetects localizable events, a second recall-type metric wasintroduced, termed frame recall ( F R ). Due to a more generalintroduction and reformulation of the metrics, DE is renamedin this work as localization error ( LE ), while F R is renamedas event count recall ( ECR ).For a detailed picture of the overall performance, thesubmissions were ranked individually for each of the four ( F , ER, LE, ECR ) metrics. A total ranking aiming to in-dicate systems achieving good performance in all metrics,or exceptional performance in most of them, was obtainedby summing the individual ranks and sorting the results inincreasing order.III. J OINT MEASUREMENT OF LOCALIZATION ANDDETECTION PERFORMANCE
Sound localization and sound event detection are tradition-ally two different areas of research, but the recent researchaddresses joint modeling and prediction of the two, motivatinga joint evaluation. An example case to illustrate the maindrawback of employing separate evaluations for detection andlocalization (similar to Subsection II-C) is visualized in Fig. 2.Both the participating systems have detected the two soundevents correctly, however, their spatial positions are swapped.Using a standalone detection metric will evaluate if the systemhas correctly predicted the sound events, and similarly, astandalone localization metric will evaluate the spatial errorsbetween the closest sound pairs (ignoring the underlying soundclasses), resulting in a perfect score for both systems in bothaspects, despite the obvious error.
A. Metrics formulation
Since a spatial event is not distinguished only by its class,but also by its location, measurement ideally happens atthe event level. Let us consider a SELD system that ata given temporal step predicts a set of M events P = { p , ..., p i , ..., p M } , where each event prediction is associatedwith a class label index ˜ b i and a positional vector ˜ x i , suchthat p i = { ˜ b i , ˜ x i } . At the same time, N reference events existas R = { r , ..., r j , ..., r N } , with each reference event being ofclass index b j at position x j , denoted as r j = { b j , x j } . Weassume a total of C possible class labels that are ordered, suchthat b ∈ [1 , .., C ] . Note that contrary to traditional SED, where reference: Dogoutput: Dog reference: Catoutput: Cat System 1
Detec (cid:1) on F1-score:
Localiza (cid:0) on error: reference: Dogoutput: Cat reference: Catoutput:
Dog
System 2
Detec (cid:1) on F1-score:
Localiza (cid:0) on error: Fig. 2. Example reference and predicted sound events and locations. Circles denote reference sounds, rectangles system output. Two systems evaluatedseparately for detection and localization performance. Based on the measured performance, they both have perfect score. predictions and references are class-based, it is possible thatmore than one events in P or R are of the same class.We begin by considering localization-only metrics, neglect-ing classification. Every combination of prediction ˜ x i and ref-erence x j is associated spatially with an appropriate distancemetric d (˜ x i , x j ) , such as angular distance in the case of DoAs,or Euclidean distance in the case of Cartesian positions. Suchdistances can be expressed with an M × N distance matrix D , where each element is given by [ D ] ij = d (˜ x i , x j ) . Beforemeasuring a mean LE across events, references and predic-tions should be associated using, for example, a minimumcost assignment algorithm such as the Hungarian algorithm, A = H ( D ) . The M × N binary association matrix A canhave maximum one unity entry at each column and row,meaning that only K = min( M, N ) = || A || predictions andreferences are associated and contribute to the LELE = 1 K (cid:88) i,j a ij d ij = || A (cid:12) D || || A || , (1)where || · || is the L , entrywise matrix norm, and (cid:12) theentrywise matrix product.The above localization precision gives a partial performancepicture because it does not take into account misses or falsealarms of localized sounds. To that purpose, we introduce asimple metric termed localization recall ( LR ), expressed as LR = (cid:80) l min( M ( l ) , N ( l ) ) (cid:80) l N ( l ) = (cid:80) l || A ( l ) || (cid:80) l N ( l ) , (2)where summation happens across temporal frame outputs,or some other preferred averaged segmental representation.Finally, a related but more concentrated metric of interest maybe the number of frames or segments for which the systemdetects the correct number of references M = N . We namethis metric event count recall ( ECR ). ECR corresponds to
ECR = (cid:80) l (cid:0) M ( l ) = N ( l ) (cid:1) L , (3)where L is the total number of segments, and ( · ) is theindicator function, returning one if its argument is true, andzero otherwise. Note that ECR was termed frame recall inthe challenge evaluation, and in [9], [11], but we opted herefor a more descriptive name of its counting objective.Often, a localization method needs to be evaluated onlyunder a certain level of spatial precision, usually expressedthrough an application-dependent threshold Θ . Such a thresh-old on the above metrics can be applied by constructing an M × N binary matrix T with unity entries only on the asso-ciated reference-predictions that are closer than the threshold, [ T ] ij = ([ D ] ij ≤ Θ) . The number of associated predictionsthat pass the threshold are then given by K ≤ Θ = || T (cid:12) A || .The thresholded metrics are LE ≤ Θ = 1 K ≤ Θ (cid:88) i,j t ij a ij d ij = || T (cid:12) A (cid:12) D || || T (cid:12) A || (4) LR ≤ Θ = (cid:80) l K ( l ) ≤ Θ (cid:80) l N ( l ) = (cid:80) l || T ( l ) (cid:12) A ( l ) || (cid:80) l N ( l ) (5) ECR ≤ Θ = (cid:80) l (cid:16) K ( l ) ≤ Θ = N ( l ) (cid:17) L . (6)Considering the fact that events have a class label in SELD,it is more informative to measure localization performanceonly between events that are correctly classified (class-awarelocalization). Similarly, we may want to impose a spatialconstraint on correct classifications, such that events classifiedcorrectly, but very far from their spatial reference are consid-ered invalid (location-aware detection). For both modes, we:1) Find subsets P c = { p i | ˜ b i = c } of predictions and R c = { r j | b j = c } of reference events classified on class c ∈ [1 , ..., C ] . The resulting class-specific number ofpredictions is M c and of references N c .2) Compute a class-dependent M c × N c distance matrix D c between predictions P c and references R c , and computethe respective association matrix A c = H ( D c ) .3) Determine a suitable application-specific spatial thresh-old Θ , for location-aware detection. Construct thethresholding binary matrix T c from D c , and determinethe number of associated predictions K c = || A c || =min( M c , N c ) , and the number of associated predictionswhich pass the threshold K c, ≤ Θ = || T c (cid:12) A c || .4) After association, count true positives T P , false nega-tives
F N , and false positives
F P as follows:
T P c, ≤ Θ = K c, ≤ Θ (7) F P c, ≤ Θ = max(0 , M c − N c ) + min( M c , N c ) − K c, ≤ Θ (8) F N c = max(0 , N c − M c ) . (9)A simple example is illustrated in Fig. 3, where the ref-erence annotation contains three sound events: dog , car horn and child , while the system output contains two: dog and cat ,at their respective positions. The joint evaluation will comparefor correctness of both the labels and the locations, thereforeit will characterize the localization error in the “dog”-“dog”pair, and consider the other events as errors (false positives andfalse negatives). Note that with the above setup false negativesdo not depend on the threshold, while false positive include Car hornDogDog CatChild d1 d2 FP Ɵ Ɵ FNFNTP
Microphone
Fig. 3. Example reference and predicted sound events and locations. Circlesdenote reference sounds, rectangles system output. both the extraneous predictions, and associated predictions thatdid not pass the threshold. Based on the above, we are ableto measure location-aware detection metrics such as precision,recall, F1-score, or error rates.Regarding class-aware localization, we compute the local-ization error ( LE c ) and localization recall ( LR c ) of Eq. (1–2)only between predictions and references of class cLE c = || A c (cid:12) D c || || A c || (10) LR c = (cid:80) l || A ( l ) c || (cid:80) l N ( l ) c . (11)The overall class-dependent LE CD , LR CD , are computed asthe class means of Eq. (10–11) LE CD = 1 C · L (cid:88) c (cid:88) l LE ( l ) c (12) LR CD = 1 C (cid:88) c LR c . (13)In some applications it may be of interest to have both class-dependent, and thresholded localization metrics, similar toEq. (4–6). In the joint measurement results of this study we usethe non-thresholded versions of Eq. (10–11). It is also worthnoting that different thresholds per class Θ c may be accom-modated in the above framework, to reflect different spatialtolerances for certain classes depending on the application. B. Segment-based measurement
Segment-based metrics are commonly used in sound eventdetection. Segment-based detection metrics generalizes theframe-based binary activity of sound events to its correspond-ing activity at segment-level. In [42], this generalization isdone by considering an event to be active at a segment-level, if it is active in atleast one frame within the segment.Similar generalization of the localization metrics to a differenttime-scale can be formulated through a spherical mean DoAvector or Cartesian mean positional vector ˆ x of all predictions ˜ x ( l ) of the corresponding event within the segment, beforelocalization errors are measured. Alternatively, the averagelocalization error within a segment can be computed basedon the frame-based pairs of reference and predicted events.Both approaches are introduced and compared in [41] withcomparable results.IV. C HALLENGE RESULTS
Even though the SELD task was introduced in DCASE2019for the first time, it attracted a lot of interest and received the second highest number of submissions among other tasks. Intotal 58 systems were submitted, from a total of 22 teamsconsisting of in total 65 members. The participants wereaffiliated with 16 universities and 8 companies.
A. Overall challenge results
The overall results of the challenge are presented in TableI. Only the best system of each team is presented, andthe systems are ordered by their official challenge rank asdescribed in Section II-C. In addition to the results dis-played on the challenge webpage, this table includes the95% confidence intervals for each separate metric, estimatedusing the jackknife procedure presented in [1]. The methodis a resampling technique that estimates a parameter from arandom sample of data for a population using partial estimates.Confidence intervals by jackknifing are coarse approximations,but applicable in cases where the underlying distribution ofthe parameter to be estimated is unknown. In our case theparameters are metrics that depend on individual combinationsof active sounds at each time, and the jackknife methodallows estimating the confidence intervals without making anyassumption on their distribution. The partial estimates for allmetrics were calculated in a leave-one-out manner, excluding,in turns, one audio file from the evaluation set.Among the 22 submitted systems, 17 of them ranked higherthan the baseline system using the official ranking method. Interms of the individual metrics, 17 systems had better ER and F -scores than the baseline, with the best ER and F -scoresof 0.06 [17], [18] and 96.7% [18] respectively. Similarly, 18systems had better LE and 14 systems had higher ECR , withthe best LE of 2.7 ◦ [22] and ECR of 96.8% [15].The top-10 systems of Table I are illustrated with respect todetection metrics in Fig. 4a and localization metrics in Fig. 4b.The best system in both these plots is in their correspondingtop left corner. We observe that the ranking order of thesubmitted systems are different for detection and localizationmetrics. For instance, the best system according to detectionmetrics -
He THU [18] (Fig. 4a top-left corner), fairs poorlyin DoA estimation compared to the other top-10 systems, andhence achieves an overall rank of four. Similarly, although
Chang HYU [22] achieved the best LE among the top-10systems, its detection performance was among the poorest oftop-10 systems and hence achieved a rank of eight. In general, ER and F -scores of event detection are correlated, and henceall the submitted systems are observed along the diagonal. Thisdiagonal behavior is not observed with the localization metricsas LE and ECR are not directly, or only weakly, correlated.All systems had at least one deep learning component intheir approach. Specifically, apart from [33] and [35] thatemployed a CNN architecture with no recurrent layers theremaining 20 systems employed different versions of thebaseline CRNN architecture as one of their components.Three of the submitted systems employed parametric DoAestimation [20], [29], [32] approach along with CRNN-basedclassification. The best parametric based DoA approach [20]achieved the 6th position. Among the DNN-based SELD meth-ods, nine of them employed multi-task learning [44] for joint
TABLE IC
HALLENGE RESULTS OF SUBMITTED SYSTEMS . T
HE RANK IS BASED ON THE CUMULATIVE RANK BASED ON THE FOUR CALCULATED METRICS . B
ESTSYSTEM PER TEAM ACCORDING TO THE OFFICIAL CHALLENGE RANKING . B
EST SCORE INDICATED FOR THE SEPARATE METRICS .Rank System ER F1 LE ECR1 Kapka SRPOL 2 [15] 0.08 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± UMMARY OF SUBMITTED SYSTEMS . T
HE RANK IS BASED ON THE CUMULATIVE RANK BASED ON THE FOUR CALCULATED METRICS . B
EST SYSTEM PERTEAM ACCORDING TO THE OFFICIAL CHALLENGE RANKING .System Audio Features Classifier Multi-task1 Kapka SRPOL 2 [15] AMB Phase and magnitude spectra CRNN × × (cid:88) × × × × × ×
10 Park ETRI 1 [24] Both Log-mel and intensity vectors CRNN, TrellisNet (cid:88)
11 Leung DBS 2 [25] AMB Log-magnitude, phase, and cross spectra CRNN ensemble (cid:88)
12 Grondin MIT 1 [26] MIC Phase and magnitude spectra, GCC and TDOA CRNN ensemble ×
13 ZhaoLu UESTC 1 [27] MIC Log-mel spectra CRNN (cid:88)
14 Rough EMED 2 [28] MIC Phase and magnitude spectra CRNN ×
15 Tan NTU 1 [29] MIC Log-mel spectra and GCC ResNet RNN, parametric DoA ×
16 Cordourier IL 2 [30] MIC Phase and magnitude spectra, and GCC CRNN ensemble (cid:88)
17 Krause AGH 4 [31] AMB Phase and magnitude spectra CRNN ensemble (cid:88)
18 Adavanne TAU FOA [14] AMB Phase and magnitude spectra CRNN (cid:88)
19 Perezlopez UPF 1 [32] AMB Log-mel spectra CRNN, parametric DoA ×
20 Chytas UTH 1 [33] MIC Raw audio and power spectra CNN ensemble ×
21 Anemueller UOL 3 [34] AMB Group-delay and magnitude spectra CRNN (cid:88)
22 Kong SURREY 1 [35] AMB Magnitude spectra CNN (cid:88)
23 Lin YYZN 1 [36] AMB Phase and magnitude spectra CRNN (cid:88)
SED and DoA estimation. The remaining systems, includingthe top ranked system [15], employed separate networks forSED and DoA estimation, and performed engineered data-association of their respective outputs. Finally, there was nosignificant improvement in SELD performance with the choiceof either of the two audio formats in the dataset. Among thetop 10 ranked systems, four of them used the microphonearray format, three used the Ambisonic format, and the restused both formats as input.
B. Analysis of individual systems
The system characteristics of all the submissions are sum-marized in Table II. A more detailed analysis of some of thesystems follow, along with a summary of the most prominentarchitectural, input feature, or training characteristics.Kapka & Lewandowski (
Kapka SRPOL ) [15] was the topperforming system of the challenge, with very high perfor-mance in both localization and detection. There was minimalfeature engineering and the pure magnitude and phase spec-trograms of the FOA format were used as input. However, Ə ĺ Ə ѵ Ə ĺ Ə Ѷ Ə ĺ Ɛ Ə Ə ĺ Ɛ Ƒ Ə ĺ Ɛ Ɠ Ə ĺ Ɛ ѵ Ə ĺ Ɛ Ѷ u u o u u - | ; Ə ĺ Ɩ Ə Ə ĺ Ɩ Ɛ Ə ĺ Ɩ Ƒ Ə ĺ Ɩ ƒ Ə ĺ Ɩ Ɠ Ə ĺ Ɩ Ɣ Ə ĺ Ɩ ѵ Ə ĺ Ɩ ƕ Ɛ v 1 o u ; - u h $ ! * ; ; ; $ &