[PDF] Event-Independent Network for Polyphonic Sound Event Localization and Detection

Abstract

Polyphonic sound event localization and detection is not only detecting what sound events are happening but localizing corresponding sound sources. This series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound event localization and detection task introduces additional challenges in moving sound sources and overlapping-event cases, which include two events of the same type with two different direction-of-arrival (DoA) angles. In this paper, a novel event-independent network for polyphonic sound event localization and detection is proposed. Unlike the two-stage method we proposed in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the network are first-order Ambisonics (FOA) time-domain signals, which are then fed into a 1-D convolutional layer to extract acoustic features. The network is then split into two parallel branches. The first branch is for sound event detection (SED), and the second branch is for DoA estimation. There are three types of predictions from the network, SED predictions, DoA predictions, and event activity detection (EAD) predictions that are used to combine the SED and DoA features for on-set and off-set estimation. All of these predictions have the format of two tracks indicating that there are at most two overlapping events. Within each track, there could be at most one event happening. This architecture introduces a problem of track permutation. To address this problem, a frame-level permutation invariant training method is used. Experimental results show that the proposed method can detect polyphonic sound events and their corresponding DoAs. Its performance on the Task 3 dataset is greatly increased as compared with that of the baseline method.

Full PDF

EEVENT-INDEPENDENT NETWORK FOR POLYPHONICSOUND EVENT LOCALIZATION AND DETECTION

Yin Cao, Turab Iqbal, Qiuqiang Kong, Yue Zhong, Wenwu Wang, Mark D. Plumbley Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK { yin.cao, t.iqbal, y.zhong, w.wang, m.plumbley } @surrey.ac.uk ByteDance Shanghai, [email protected]

ABSTRACT

Polyphonic sound event localization and detection is not only detect-ing what sound events are happening but localizing correspondingsound sources. This series of tasks was ﬁrst introduced in DCASE2019 Task 3. In 2020, the sound event localization and detectiontask introduces additional challenges in moving sound sources andoverlapping-event cases, which include two events of the same typewith two different direction-of-arrival (DoA) angles. In this paper,a novel event-independent network for polyphonic sound event lo-calization and detection is proposed. Unlike the two-stage methodwe proposed in DCASE 2019 Task 3, this new network is fullyend-to-end. Inputs to the network are ﬁrst-order Ambisonics (FOA)time-domain signals, which are then fed into a 1-D convolutionallayer to extract acoustic features. The network is then split intotwo parallel branches. The ﬁrst branch is for sound event detection(SED), and the second branch is for DoA estimation. There are threetypes of predictions from the network, SED predictions, DoA predic-tions, and event activity detection (EAD) predictions that are used tocombine the SED and DoA features for on-set and off-set estimation.All of these predictions have the format of two tracks indicating thatthere are at most two overlapping events. Within each track, therecould be at most one event happening. This architecture introduces aproblem of track permutation. To address this problem, a frame-levelpermutation invariant training method is used. Experimental resultsshow that the proposed method can detect polyphonic sound eventsand their corresponding DoAs. Its performance on the Task 3 datasetis greatly increased as compared with that of the baseline method.

Index Terms — Sound event localization and detection, direc-tion of arrival, event-independent, permutation invariant training.

1. INTRODUCTION

Sound event localization and detection (SELD) has become a moreand more popular research topic since DCASE 2019. It detects typesof sound events and localizes corresponding sound sources. Thisyear, DCASE 2020 Task 3 [1–4] introduces additional challenges inmoving sources and polyphonic cases that include the same class ofevent but with different direction-of-arrival (DoAs).For DCASE 2019 Task 3, we introduced a two-stage methodfor polyphonic SELD [5]. Although we obtained a good ranking,the method was not designed as an actual polyphonic localizationmethod for the reason that it lacks the ability to detect sound eventsof the same type but with different DoAs. Besides, it is not an elegantend-to-end system, which needs to be trained for two steps. In this paper, we propose a redesigned event-independent end-to-end system for polyphonic SELD. It is designed for overlapping-event cases, including the presence of the same type of event withdifferent DoAs. It is also convenient to expand the system to the caseof more than two overlapping events. The source code is releasedon GitHub . Our contributions are three-fold. 1) The proposedsystem predicts overlapping events using track-wise outputs, that is,it predicts event and corresponding DoA for each track. Within eachtrack there may only be maximally one event and corresponding DoAexisting. 2) Frame-level permutation invariant training is adopted tosolve the track permutation problem. 3) An event activity detection(EAD) prediction is added to combine sound event detection (SED)and DoA estimation feature embeddings to predict the on-set andoff-set times more accurately. Network

Track 1PredictionTrack 2Prediction

SpeechCar SpeechDog Bark Dog BarkCar1 ? Figure 1: Illustration of the track permutation problem. Numbersmean different group of labels.We adopt the track-wise prediction to tackle the overlapping-event scenarios. It is needed to pre-determine the number of tracksaccording to the maximum number of overlapping events. Thesetracks are event-independent, which means the prediction on eachtrack can be of any type of event. They can even be the same typeof event which indicates that two same-type events with differentDoAs are predicted. It is also reasonable to assume these tracks areevent-independent. Consider a polyphonic prediction case illustratedin Fig. 1. The network has one prediction for each track, withinwhich there could only be maximally one event and correspondingDoA. There are three groups of labels which are all potentially two-event overlapping cases. It is assumed that, for the ﬁrst group, the“speech” label and the “car” label are tied to track 1 and 2. For thesecond group, it is reasonable to still assign the “speech” label totrack 1, and the new “dog bark” label to track 2. However, for thethird group of labels, it is hard to decide to which track the “dogbark” or the “car” label should be assigned. In other words, trackpermutation problems emerge if track-wise predictions are used. https://github.com/yinkalario/EIN-SELD a r X i v : . [ ee ss . A S ] S e p o address the track permutation problem, frame-level permu-tation invariant training (denoted as tPIT) is used. The frame-level permutation invariant training was ﬁrst proposed for speaker-independent source separation [6, 7]. For our SELD problem, thetPIT is implemented by examining all possible track permutations ineach frame during training, and then the lowest frame-level loss isselected among these track permutations for the backward propaga-tion to train the model. In this way, the optimal local assignment oftrack-event pairs can be reached, thus leading to the excellent SEDand DoA prediction performance frame-wise.In order to estimate the frame-level information more accurately,the features from both SED and DoA branches are combined topredict event activities. The aim of this EAD is to constrain thedetection of the existence of events (or DoAs) not only from bothSED and DoA features. That means SED and DoA predictions havea mutual dependence rather than a one-way dependence. With theproposed system, experimental results show that the performance isgreatly increased compared with that of the baseline system.The rest of the paper is arranged as follows. In Section 3, theproposed learning method is described in detail, including features,network architecture and permutation invariant training. Experi-mental results and discussions are shown in Section 4. Finally,conclusions are summarized in Section 5.

2. RELATED WORKS2.1. Sound Event Localization and Deteciton

Sound event localization and detection is a novel topic that has wideapplications [8]. In DCASE 2019 Task 3 [2, 9], the TAU SpatialSound Events 2019 dataset was released [10]. There were severalinnovative papers based on that dataset. Mazzon et al. proposed aspatial-augmentation method which rotates and reﬂects sources sothat unseen DoA data and corresponding labels are produced [11].Grondin et al. used a CRNN on pairs of microphones to localizeand detect sound events [12]. Sundar et al. proposed an encodingscheme to represent spatial coordinates of multiple sources [13].Although the two-stage method we proposed [5] last year achievedthe second-best performance in DCASE 2019 Task 3, it is not anelegant end-to-end system and is not designed for overlapping eventsof the same type with different DoAs. Then, an idea of similartrack-wise prediction was proposed by Nguyen et al. [14]. However,their system did not show a reasonable bond between SED andDoA predictions, they assume the track prediction with the highestprobability of SED corresponds to the same highest probability trackof DoA. In this paper, it will be shown in the following part that thesystem proposed is a more complete system in this sense.

Permutation invariant training (PIT) was ﬁrst proposed to tackle theproblem of speaker-independent multi-talker speech separation [6],commonly known as the cocktail-party problem. PIT combines thelabel assignment and minimization together, and can be implementedinside the network structure. It ﬁrst assigns the best predict-targetpairs according to which way the total loss is the smallest, and thenminimizes the loss given the assignment. PIT was then extended toframe-level and utterance-level PIT [7, 15–17], which were utilizedby a range of speaker-independent speech separation researchers[18–22]. In our proposed method, a track-wise output format is used.Tracks are event-independent, that is, tracks can predict any type of event. This generates a problem of track permutation. It will beshown that by adopting a similar idea to the frame-level PIT, thetrack permutation problem can be excellently solved.

3. THE METHOD

The proposed method is described in this section. Features usedare logmel and intensity vector. They are calculated inside thenetwork using a 1-D convolutional layer. The network architectureand permutation invariant training will be introduced in detail.

In this paper, a logmel spectrogram feature is used for SED, while anintensity vector from ﬁsrt-order ambisonics (FOA) in logmel space isused for DoA estimation. These features are directly calculated usinga 1-D convolutional layer instead of being pre-calculated off-line.FOA, which is also known as B-format, includes four chan-nels of signals, w , x , y and z . These four channel signals indicateomni-directional, x -directional, y -directional and z -directional com-ponents, respectively. The instantaneous sound intensity vector canbe expressed as I = p v , where p is the sound pressure and can beobtained with w , v = ( v x , v y , v z ) T is the particle velocity vectorand can be estimated using x , y and z . An intensity vector carries theinformation of the acoustical energy direction of a sound wave, its in-verse direction can be interpreted as the DoA, hence the FOA-basedintensity vector can be directly utilized for DoA estimation [23–26].In order to make the intensity vector have the same size as thelogmel, it is calculated in the STFT domain and the mel space as I ( f, t ) = 1 ρ c (cid:60)  W ∗ ( f, t ) ·  X( f, t )Y( f, t )Z( f, t )  , I norm mel ( k, t ) = − H mel ( k, f ) I ( f, t ) (cid:107) I ( f, t ) (cid:107) , (1)where, ρ and c are the density and velocity of the sound, W , X , Y , Z are the STFT of w , x , y , z , respectively, (cid:60) {·} indicates the real part, ∗ denotes the conjugate, (cid:107) · (cid:107) is a vector’s (cid:96) norm, k is the indexof the mel bins, and H mel is the mel-band ﬁlter banks. In thispaper, the three components of the intensity vector are taken as threeadditional input channels for the network. The proposed event-independent network uses track-wise outputs.For track-wise predictions, it is needed to pre-determine the numberof tracks according to the maximum number of overlapping events.These tracks are event-independent, which means the prediction oneach track can be of any type of event.The network has two branches of feature embeddings, the SEDbranch and the DoA branch. Its architecture is shown in Fig. 2.FOA time-domain signals are used as inputs and are ﬁrst fed intotwo branches. In both of the SED and the DoA branches, a 1-Dconvolutional layer is ﬁrst used to extract logmel spectrograms andintensity vectors. They are then normalized by batch-normalizationlayers. For the SED feature embedding, four groups of convolutionalblocks are used to extract the SED embedding. Each convolutionalblock contains two 2-D convolutional layers with a kernel size of3x3, a batch-normalization layer, and an average-pooling layer. Forthe DoA feature embedding, a revised ResNet 18 with two 3x3 2-Dconvolutional layers as the stem-layer is used. The size of the feature ime Domain SED FeatureEmbeddingDoA FeatureEmbedding B i d i r e c t i ona l G RU B i d i r e c t i ona l G RU B i d i r e c t i ona l G RU F u ll y C onne c t ed Track 1 F u ll y C onne c t ed F u ll y C onne c t ed Track 2Track 1Track 2Track 1 , azimuth and elevation

Track 2 , azimuth and elevation

SEDBranchDoABranch EADPrediction P a i r w i s e Lo ss A li gn m en t Label 1SED, EAD, DoALabel 2SED, EAD, DoA

Frame-levelPermutation InvariantTraining

EAD mask

Figure 2: Network Architecture. N frames is the number of frames. N cla is the number of classes of events. In the SED branch, there is oneadditional class of event that is silence. EAD mask block is decribed in Eq. 4.maps is 512 in the last layer for both SED and DoA embeddings.These two branches are then used to generate three predictions: SEDpredictions, EAD predictions, and DoA predictions. For SED andDoA predictions, the respective feature embeddings are fed into atwo-layer bidirectional GRU and a fully-connected layer. For EADpredictions, SED and DoA embeddings are combined and are fedinto a similar GRU and fully-connected layer. Outputs of the networkhave a track-wise format. Each track has at most one SED, one EAD,and one DoA prediction.For SED, predictions have N cla + 1 types of events. Here, N cla is the number of event classes. The additional class of event indicatesthe silent class (no event is happening). The softmax activationfunction is used after each track for SED. The corresponding loss forSED predictions can be written as (cid:96) SED t ( n track ) = − log (cid:34) exp ( y SED t [ n track , class t ]) (cid:80) j ∈ J exp( y SED t [ n track , j ]) (cid:35) , L SED = (cid:88) n ∈ N track ,t (cid:96) SED t ( n ) , (2)where (cid:96) SED t is the track-wise loss. L SED is the total SED loss forupdating the model. y SED denotes the output logits of the SED fully-connected layer, t indicates the frame, n track is the track index, class t is the ground truth target at frame t , N track is the number of tracks,and J is the class set. It is a multi-class single-label problem foreach track and a multi-class multi-label problem for all of the tracks.EAD predictions need to combine SED and DoA embeddings.Binary cross entropy is used as the loss. The existence of EADpredictions is important for three reasons. First, it uses SED and DoAfeature embeddings together to predict on-set and off-set information.In this way, the frame-level information does not solely depend on asingle branch, label information from both branches can be utilized.In the mean time, the EAD loss can be back propagated to affect bothbranches of the feature embeddings; second, it constrains SED andDoA feature embeddings to unify track-binding so that tracks do notpermute in different branches. That is, track 1 in the SED predictioncan always be tied to track 1 in the EAD and DoA predictions. Third,EAD predictions contribute to mask out DoA predictions.For DoA, predictions for tracks contain azimuth and elevationangles. A linear activation function is used. Under these circum-stances, when there is no event happening, the ground truth DoAs should not be zero. It is therefore reasonable to use a mask to shieldthose invalid frames. During training, ground truth EAD labels areused as the mask to ﬁlter out those frames with actual events happen-ing. Whereas during test, an intersection set of the SED and EADmasks is used. The loss for DoA predictions can be written as Eq. 3, (cid:96) DoA t ( n track ) = 12 (cid:88) azi, elev {(cid:107) y DoA t − ˆ y DoA t (cid:107) p · M EAD t ( n track ) } , L DoA = 1 (cid:80) n ∈ N track ,t M EAD t ( n ) (cid:88) n ∈ N track ,t (cid:96) DoA t ( n ) , (3)where M EAD t =  ˆ y EAD t for training, [ y EAD t > τ EAD ] ∩ [ y SED t = max J y SED t ][0 : N cla ] for test, (4)is the mask for DoA predictions. (cid:96) DoA t and L DoA are deﬁned thesame as the SED loss. ˆ y DoA t is the DoA ground truth. (cid:107)·(cid:107) p is the p -norm. ˆ y EAD t is the EAD ground truth. [ C ] is the binarize function,where [ C ] is 1 when C is True, and 0 when C is False. τ EAD is thethreshold for EAD. J is the class set. [0 : N cla ] means to take theﬁrst N cla items. After track-wise predictions are obtained, frame-level permutationinvariant training (tPIT) is adopted to tackle the track permutationproblem. In Fig. 2, the tPIT contains the Pairwise-Loss-Alignmentblock, which assigns labels to different tracks to constitute all ofthe possible combinations of prediction-label pairs. Then the totalloss for each combination is calculated. The lowest total loss isselected as the actual loss to perform the back-propagation. Assumeall of the possible combinations of prediction-label pairs constitutea permutation set P . α ( t ) ∈ P is one of the possible permutationpairs at frame t . The tPIT loss can be written as Eq. 5, L tPITt = min α ( t ) ∈ P (cid:88) N track { (cid:96) SED t,α ( t ) + (cid:96) EAD t,α ( t ) + (cid:96) DoA t,α ( t ) } , (5)here (cid:96) SED t,α ( t ) and (cid:96) DoA t,α ( t ) are deﬁned in Eq. 2 and 3, respectively. (cid:96) EAD t,α ( t ) is the binary cross entropy loss.Therefore, the process of tPIT is not only to perform the clas-siﬁcation or regression training but also to pair the most probablepredictions and labels inside the network. In this way, the event-independent track permutation problem can be elegantly solved.

4. EXPERIMENTS

In this section, the experimental results using the proposed method onthe TAU-NIGENS Spatial Sound Events 2020 dataset are described.

The dataset contains 700 sound event samples spread over 14 classes.Within the dataset, realistic spatialization and reverberation throughRIRs were collected in 15 different enclosures. From about 1500to 3500 possible RIR positions across different rooms were applied.The dataset contains both static reverberant and moving reverberantsound events. More information can be obtained in [1, 4].The evaluation metrics used consider the joint nature of local-ization and detection [3]. There are two metrics for SED, F-score ( F ≤ T ◦ ) and Error Rate ( ER ≤ T ◦ ) . They consider true positivespredicted under a distance threshold T = 20 ◦ from the reference.For the localization part, there are other two metrics which areclassiﬁcation-dependent. The ﬁrst is a localization error LE CD ex-pressing average angular distance between predictions and referencesof the same class. The second is a localization recall metric LR CD expressing the true positive rate of how many of these localizationestimates were detected in a class, out of the total class instances.To generate the weights of the 1-D convolutional layer for featureextraction. A -point Hanning window with a hop size of points is used. Audio clips are segmented to have a ﬁxed length of seconds with

80 % overlap for training. The learning rate is setto . for the ﬁrst epochs and is adjusted to . for eachepoch that follows. The ﬁnal results are calculated after epochs.A threshold of . is used to binarize EAD predictions. In order to assess the performance of the proposed system, an abla-tion study is performed. Several systems are compared, including: • Baseline-FOA : The baseline method using Ambisonic. • Baseline-Mics : The baseline method using microphone array. • Track-Wise 1 : Track-wise output without EAD and tPIT. • Track-Wise 2 : Track-wise output with EAD but without tPIT.SED predictions are used as the mask to ﬁlter out active tracks. • Track-Wise 3 : Track-wise output with EAD but without tPIT.SED and EAD predictions are together used as the mask to ﬁlterout active tracks according to Eq. 4. • Event-Ind : The proposed event-independent system with EADand tPIT. SED and EAD predictions are together used as the mask.All of the experimental results were evaluated on fold and wereaveraged with 5 different trials. Results for comparison are shown inFig. 3. It can be seen that all of the proposed methods are better thanthe baselines with two-stage method [5]. A simple comparison forablation study shows that without EAD and tPIT, the “Track-Wise 1”method is the worst among proposed methods. There seems to bean exception of LE CD , where “Track-Wise 1” method is even lower than “Track-Wise 2” and “Track-Wise 3”. This is due to the reasonthat during experiments, it was found that there is a trade-off between LR CD and LE CD . That is, when LR CD is increased, LE CD getsworse. This is probably because when more DoAs are detected, ahigher number of wrong frame-wise DoA angles are predicted onaverage, hence LE CD gets worse. “Track-Wise 3” is slightly betterthan “Track-Wise 2”, which indicates that the mask using SED andEAD predictions is more effective than using SED preidictions alone.This additionally proves the signiﬁcance of EAD.Figure 3: Comparison of different methods.The “Event-Ind” method achieves the best performance, whichmeans additional EAD and tPIT features all contribute to increasethe performance. The additional EAD prediction can constrain andunify predictions from SED and DoA branches both in terms of thetemporal information and track-binding. The tPIT can rectify thoseincorrect label assignments and greatly increase the performance.

5. CONCLUSION

We proposed a new end-to-end event-independent network for poly-phonic sound event localization and detection. The network treatspolyphonic cases as multi-track problems, with each track having atmost one event and its corresponding direction-of-arrival. In orderto solve the problem of track permutation, a frame-level permutationinvariant training strategy is employed. The network outputs threepredictions which are sound event detection, event activity detection,and direction-of-arrival. Event activity detection encompasses thefeature embedding information from both SED and DoA, hence isable to predict on-set and off-set times of events more accurately.The proposed system is easy to extend to more than two overlapping-event cases. Experimental results show that the proposed systemoutperforms the baseline methods by a large margin.

6. ACKNOWLEDGMENT

This work was supported in part by EPSRC Grants EP/P022529/1,EP/N014111/1 ”Making Sense of Sounds”, EP/T019751/1 ”AI forSound”, National Natural Science Foundation of China (Grant No.11804365), EPSRC grant EP/N509772/1, “DTP 2016-2017 Univer-sity of Surrey”, and China Scholarship Council (No. 201906470002). . REFERENCES [1] A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverber-ant spatial sound scenes with moving sources for sound eventlocalization and detection,” arXiv e-prints: 2006.01919 , 2020.[2] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Soundevent localization and detection of overlapping sources usingconvolutional recurrent neural networks,”

IEEE Journal ofSelected Topics in Signal Processing , vol. 13, pp. 34–48, 2018.[3] A. Mesaros, S. Adavanne, A. Politis, T. Heittola, and T. Vir-tanen, “Joint measurement of localization and detection ofsound events,” in

IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics (WASPAA) , NY, 2019.[4] “DCASE 2020 task 3 sound event localization anddetection,” 2020. [Online]. Available: http://dcase.community/challenge2020/task-sound-event-localization-and-detection[5] Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumb-ley, “Polyphonic sound event detection and localization usinga two-stage strategy,” in

Proceedings of the Detection andClassiﬁcation of Acoustic Scenes and Events 2019 Workshop(DCASE2019) , NY, 2019, pp. 30–34.[6] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation in-variant training of deep models for speaker-independent multi-talker speech separation,” in .IEEE, 2017, pp. 241–245.[7] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalkerspeech separation with utterance-level permutation invarianttraining of deep recurrent neural networks,”

IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 25,no. 10, pp. 1901–1913, 2017.[8] T. Virtanen, M. D. Plumbley, and D. Ellis,

ComputationalAnalysis of Sound Scenes and Events . Springer, 2018, pp.3–12.[9] S. Adavanne, A. Politis, and T. Virtanen, “A multi-room rever-berant dataset for sound event localization and detection,” in

Submitted to Detection and Classiﬁcation of Acoustic Scenesand Events 2019 Workshop (DCASE2019) , 2019.[10] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for poly-phonic sound event detection,”

Applied Sciences , vol. 6, no. 6,p. 162, 2016.[11] L. Mazzon, Y. Koizumi, M. Yasuda, and N. Harada, “Firstorder ambisonics domain spatial augmentation for DNN-baseddirection of arrival estimation,” in

Proceedings of the Detec-tion and Classiﬁcation of Acoustic Scenes and Events 2019Workshop (DCASE2019) , NY, 2019, pp. 154–158.[12] F. Grondin, I. Sobieraj, M. D. Plumbley, and J. Glass, “Soundevent localization and detection using CRNN on pairs of mi-crophones,” in

Proceedings of the Detection and Classiﬁcationof Acoustic Scenes and Events 2019 Workshop (DCASE2019) ,NY, October 2019, pp. 84–88.[13] H. Sundar, W. Wang, M. Sun, and C. Wang, “Raw waveformbased end-to-end deep convolutional network for spatial local-ization of multiple acoustic sources,” in

ICASSP 2020-2020IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) . IEEE, 2020, pp. 4642–4646. [14] T. N. T. Nguyen, D. L. Jones, and W.-S. Gan, “A sequencematching network for polyphonic sound event localization anddetection,” in

ICASSP 2020-2020 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2020, pp. 71–75.[15] Z. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deepclustering: Discriminative spectral and spatial embeddingsfor speaker-independent speech separation,” in . IEEE, 2018, pp. 1–5.[16] C. Xu, W. Rao, X. Xiao, E. S. Chng, and H. Li, “Single channelspeech separation with constrained utterance level permutationinvariant training using grid LSTM,” in . IEEE, 2018, pp. 6–10.[17] C. Fan, B. Liu, J. Tao, Z. Wen, J. Yi, and Y. Bai, “Utterance-level permutation invariant training with discriminative learn-ing for single channel speech separation,” in . IEEE, 2018, pp. 26–30.[18] Y. Liu and D. Wang, “Divide and conquer: A deep CASAapproach to talker-independent monaural speaker separation,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, no. 12, pp. 2092–2102, 2019.[19] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independentspeech separation with deep attractor network,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing ,vol. 26, no. 4, pp. 787–796, 2018.[20] Y. Liu, M. Delfarah, and D. Wang, “Deep CASA for talker-independent monaural speech separation,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2020, pp. 6354–6358.[21] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Has-sidim, W. T. Freeman, and M. Rubinstein, “Looking to listen atthe cocktail party: A speaker-independent audio-visual modelfor speech separation,” arXiv preprint arXiv:1804.03619 , 2018.[22] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network forsingle-microphone speaker separation,” in . IEEE, 2017, pp. 246–250.[23] J. Kotus, K. Lopatka, and A. Czyzewski, “Detection and local-ization of selected acoustic events in acoustic ﬁeld for smartsurveillance applications,”

Multimedia Tools and Applications ,vol. 68, no. 1, pp. 5–21, 2014.[24] J. Ahonen, V. Pulkki, and T. Lokki, “Teleconference appli-cation and B-format microphone array for directional audiocoding,” in

Audio Engineering Society Conference: 30th Inter-national Conference: Intelligent Audio Environments . AudioEngineering Society, 2007.[25] Y. Cao, T. Iqbal, Q. Kong, M. Galindo, W. Wang, and M. D.Plumbley, “Two-stage sound event localization and detec-tion using intensity vector and generalized cross-correlation,”DCASE2019 Challenge, Tech. Rep., 2019.[26] L. Perotin, R. Serizel, E. Vincent, and A. Gu´erin, “CRNN-based multiple doa estimation using acoustic intensity featuresfor ambisonics recordings,”