[PDF] FCN Approach for Dynamically Locating Multiple Speakers

Abstract

In this paper, we present a deep neural network-based online multi-speaker localisation algorithm. Following the W-disjoint orthogonality principle in the spectral domain, each time-frequency (TF) bin is dominated by a single speaker, and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using both simulated and real-life recordings in static and dynamic scenarios, confirms that the proposed algorithm outperforms both classic and recent deep-learning-based algorithms.

Full PDF

FFCN Approach for Dynamically Locating MultipleSpeakers

Hodaya Hammer

Department of Electrical EngineeringBar-Ilan UniversityRamat-Gan, 5290002Israel [email protected]

Shlomo E. Chazan

Department of Electrical EngineeringBar-Ilan UniversityRamat-Gan, 5290002Israel

[email protected]

Jacob Goldberger

Department of Electrical EngineeringBar-Ilan UniversityRamat-Gan, 5290002Israel [email protected]

Sharon Gannot

Department of Electrical EngineeringBar-Ilan UniversityRamat-Gan, 5290002Israel

[email protected]

Abstract

In this paper, we present a deep neural network-based online multi-speaker locali-sation algorithm. Following the W-disjoint orthogonality principle in the spectraldomain, each time-frequency (TF) bin is dominated by a single speaker, and henceby a single direction of arrival (DOA). A fully convolutional network is trainedwith instantaneous spatial features to estimate the DOA for each TF bin. Thehigh resolution classiﬁcation enables the network to accurately and simultaneouslylocalize and track multiple speakers, both static and dynamic. Elaborated experi-mental study using both simulated and real-life recordings in static and dynamicscenarios, conﬁrms that the proposed algorithm outperforms both classic and recentdeep-learning-based algorithms.

Locating multiple sound sources recorded with a microphone array in an acoustic environment isan essential component in various cases such as source separation and scene analysis. The relativelocation of a sound source with respect to a microphone array is generally given in the term of theDOA of the sound wave originating from that location. DOA estimation and tracking are generatinginterest lately, due to the need for far-ﬁeld enhancement and recognition in smart home devices.In real-life environments, sound sources are captured by the microphones together with acousticreverberation. While propagating in an acoustic enclosure, the sound wave undergoes reﬂectionsfrom the room facets and from various objects. These reﬂections deteriorate speech quality and, inextreme cases, its intelligibility. Furthermore, reverberation increases the time dependency betweenspeech frames, making source DOA estimation a very challenging task.A plethora of classic signal processing-based approaches have been proposed throughout the years forthe task of broadband DOA estimation. The multiple signal classiﬁcation (MUSIC) algorithm [19]applies a subspace method that was later adapted to the challenges of speech processing in [7]. Thesteered response power with phase transform (SRP-PHAT) algorithm [2] uses generalizations ofcross-correlation methods for DOA estimation. These methods are still widely in use. However, inhigh reverberation enclosures, their performance is not satisfactory.

Preprint. Under review. a r X i v : . [ ee ss . A S ] A ug upervised learning methods encompass an advantage for this task since they are data-driven. Deep-learning methods can be trained to ﬁnd the DOA in different acoustic conditions. Moreover, if anetwork is trained using rooms with different acoustic conditions and multiple noise types, it canbe made robust against noise and reverberation even for rooms which were not in the training set.Deep learning methods have recently been proposed for sound source localization. In [26, 23] simplefeed-forward deep neural networks (DNNs) were trained using generalized cross correlation (GCC)-based audio features, demonstrating improved performance as compared with classic approaches.Yet, this method is mainly designed to deal with a single sound source at a time. In [21] theauthors trained a DNN for multi-speaker DOA estimation. In high reverberation conditions, however,their performance is not satisfactory. In [16, 22] time domain features were used and they haveshown performance improvement in highly-reverberant enclosures. In [3], a convolutional neuralnetwork (CNN) based classiﬁcation method was applied in the short-time Fourier transform (STFT)domain for broadband DOA estimation, assuming that only a single speaker is active per time frame.The phase component of the STFT coefﬁcients of the input signal were directly provided as inputto the CNN. This work was extended in [4] to estimate multiple speakers’ DOAs, and has shownhigh DOA classiﬁcation performances. In this approach, the DOA is estimated for each frameindependently. The main drawback of most DNN-based approaches, however, is that they onlyuse low-resolution supervision, namely only time frame or even utterance-based labels. In speechsignals, however, each time-frequency bin is dominated by a single speaker, a property referred toas W-disjoint orthogonality (WDO) [17]. Adopting this model results in higher resolution, whichmight be beneﬁcial for the task at hand. This model was also utilized in [5] for speech separationwhere the authors recast the separation problem as a DOA classiﬁcation at the TF domain. A fullyconvolutional network (FCN) was trained using spatial features to infer the DOA at every TF bin.Although the DOA resolution was relatively low, it was sufﬁcient for the separation task at lowreverberation conditions. When applying this method in high-reverberation enclosures or to separateadjacent speakers, a performance degradation was observed.In this work, we present a multi-speaker DOA estimation algorithm. According to the WDO propertyof speech signals [17, 27], each TF bin is dominated by (at most) a single speaker. This TF bin cantherefore be associated with a single DOA. We use instantaneous spatial cues from the microphonesignals. These features are used to train a FCN to infer the DOA of each TF bin. The FCN is trainedto address various reverberation conditions. The TF-based classiﬁcation facilitates the trackingability for multiple moving speakers. In addition, unlike many other supervised domains, the DOAdomain lacks a standard benchmark. The LOCATA dataset [9] was recorded in one room withrelatively low reverberation (RT = 0 . ). Furthermore, a training dataset with high TF labels isnot publicly available. Therefore, we generated training and test datasets simulating various real-lifescenarios. We tested the proposed method on simulated data, using publicly available room impulseresponses (RIRs) recorded in a real room [11], as well as real-life experiments. We show that theproposed algorithm signiﬁcantly outperforms state-of-the-art competing methods.The main contribution of this paper is the A high resolution TF-based approach that improves DOAestimation performances with respect to (w.r.t.) the state-of-the-art (SOTA) approaches, which areframe-based, and enables simultaneously tracking multiple moving speakers. Consider an array with M microphones acquiring a mixture of N speech sources in a reverberantenvironment. The i -th speech signal s i ( t ) propagates through the acoustic channel before beingacquired by the m -th microphone: z m ( t ) = N (cid:88) i =1 s i ( t ) ∗ h im ( t ) , m = 1 , . . . , M, (1)where h im is the RIR relating the i -th speaker and the m -th microphone. In the STFT domain (1) canbe written as (provided that the frame-length is sufﬁciently large w.r.t. the ﬁlter length): z m ( l, k ) = N (cid:88) i =1 s i ( l, k ) h im ( l, k ) , (2)2here l and k , are the time frame and the frequency indices, respectively.The STFT (2) is complex-valued and hence comprises both spectral and phase information. It isclear that the spectral information alone is insufﬁcient for DOA estimation. It is therefore a commonpractice to use the phase of the TF representation of the received microphone signals, or theirrespective phase-difference, as they are directly related to the DOA in non-reveberant environments.We decided to use an alternative feature, which is generally independent of the speech signal andis mainly determined by the spatial information. For that, we have selected the relative transferfunction (RTF) [10] as our feature, since it is known to encapsulate the spatial ﬁngerprint for eachsound source. Speciﬁcally, we use the instantaneous relative transfer function (iRTF), which is thebin-wise ratio between the m -th microphone signal and the reference microphone signal z ref ( l, k ) :iRTF ( m, l, k ) = z m ( l, k ) z ref ( l, k ) . (3)Note, that the reference microphone is arbitrarily chosen. Reference microphone selection is beyondthe scope of this paper (see [20] for a reference microphone selection method). The input feature setextracted from the recorded signal is thus a 3D tensor R : R ( l, k, m ) = [ Re ( iRTF ( m, l, k )) , Im ( iRTF ( m, l, k ))] . (4)The matrix R is constructed from L × K bins, where L is the number of time frames and K isthe number of frequencies. Since the iRTFs are normalized by the reference microphone, it isexcluded from the features. Then for each TF bin ( l, k ) , there are P = 2( M − channels, wherethe multiplication by is due to the real and imaginary parts of the complex-valued feature. For eachTF bin the spatial features were normalized to have a zero mean and a unit variance.Recall that the WDO assumption [17] implies that each TF bin ( l, k ) is dominated by a single speaker.Consequently, as the speakers are spatially separated, i.e. located at different DOAs, each TF bin isdominated by a single DOA.Our goal in this work is to accurately estimate the speaker direction at every TF bin from the givenmixed recorded signal. We formulated the DOA estimation as a classiﬁcation task by discretizing the DOA range. Theresolution was set to ◦ , such that the DOA candidates are in the set Θ = { ◦ , ◦ , ◦ , . . . , ◦ } .Let D l,k be a random variable (r.v.) representing the active dominant direction, recorded at bin ( l, k ) .Our task boils down to deducing the conditional distribution of the discrete set of DOAs in Θ foreach TF bin, given the recorded mixed signal: p l,k ( θ ) = p ( D l,k = θ |R ) , θ ∈ Θ . (5)For this task, we use a DNN. The network output is an L × K × | Θ | tensor, where | Θ | is thecardinality of the set Θ . Under this construction of the feature tensor and output probability tensor, apixel-to-pixel approach for mapping a 3D input ‘image’, R and a 3D output ‘image’, p l,k ( θ ) , can beutilized. An FCN is used to compute (5) for each TF bin. The pixel-to-pixel method is beneﬁcialin two ways. First, for each TF bin in our input image the network estimates the DOA distributionseparately. Second, the TF supervision is carried out with the spectrum of the different speakers.The FCN hence takes advantage of the spectral structure and the continuity of the sound sourcesin both the time and frequency axes. These structures contribute to the pixel-wise classiﬁcationtask, and prevent discontinuity in the DOA decisions over time. In our implementation, we useda U-net architecture, similar to the one described in [18]. We dub our algorithm time-frequencydirection-of-arrival net (TF-DOAnet).The input to the network is the feature matrix R (4). In our U-net architecture, the input shapeis ( L, K, P ) where K = 256 is the number of frequency bins, L = 256 is the number of frames,and P = 2 M − where M is the number of microphones. The overlap between successive STFTframes is set to . This allows to improve the estimation accuracy of the RTFs, by averagingthree consecutive frames both in the numerator and denominator of (3), without sacriﬁcing theinstantaneous nature of the RTF. 3 M ( t ) z ( t ) ··· z M ( l,k ) iRTF(1 ,l,k )iRTF( M − ,l,k ) ··· ··· Concat R z ( l,k ) z ( l,k ) iRTF(2 ,l,k ) FCN p l,k ( θ = 0) p l,k ( θ = 5) p l,k ( θ = 180) ··· R e {} I m {} Figure 1: Block diagram of the TF-DOAnet algorithm. The dashed envelope describes the featureextraction step.TF bins in which there is no active speech are non-informative. Therefore, the estimation is carriedout only on speech-active TF bins. As we assume that the acquired signals are noiseless, we deﬁne aTF-based voice activity detector (VAD) as follows:VAD ( l, k ) = ß | z ref ( l, k ) | ≥ (cid:15) o.w. , (6)where (cid:15) is a threshold value. In noisy scenarios, we can use a robust speech presence probability (SPP)estimator instead of the VAD [24].The task of DOA estimation only requires time frame estimates. Hence, we aggregate over all activefrequencies at a given time frame to obtain a frame-wise probability: p l ( θ ) = 1 K (cid:48) K (cid:88) k =1 p l,k ( θ ) VAD ( l, k ) . (7)where K (cid:48) is the number of active frequency bands at the l -th time frame. We thus obtain for each timeframe a posterior distribution over all possible DOAs. If the number of speakers is known in advance,we can choose the directions corresponding to the highest posterior probabilities. If an estimate ofthe number of speakers is also required, it can be determined by applying a suitable threshold. Figure1 summarizes the TF-DOAnet in a block diagram. The supervision in the training phase is based on the WDO assumption in which each TF bin isdominated by (at most) a single speaker. The training is based on simulated data generated bya publicly availble RIR generator software , efﬁciently implementing the image method [1]. Afour microphone linear array was simulated with (8 , , cm inter-microphones distances. Similarmicrophone inter-distances were used in the test phase. For each training sample, the acousticconditions were randomly drawn from one of the simulated rooms of different sizes and differentreverberation levels RT as described in Table 1. The microphone array was randomly placed in theroom in one out of six arbitrary positions.For each scenario, two clean signals were randomly drawn from the Wall Street Journal 1 (WSJ1)database [15] and then convolved with RIRs corresponding to two different DOAs in the range Θ = { , , . . . , } . The sampling rate of all signals and RIRs was set to KHz. The speakerswere positioned in a radius of r = 1 . m from the center of the microphone array. To enrich thetraining diversity, the radius of the speakers was perturbed by a Gaussian noise with a variance of . m. The DOA of each speaker was calculated w.r.t. the center of the microphone array.The contributions of the two sources were then summed with a random signal to interferenceratio (SIR) selected in the range of SIR ∈ [ − , to obtain the received microphone signals. Next,we calculated the STFT of both the mixture and the STFT of the separate signals with a frame-length K = 512 and an overlap of between two successive frames. Available online at github.com/ehabets/RIR-Generator . m in heightSimulated training dataRoom 1 Room 2 Room 3 Room 4 Room 5Room size ( × ) m ( × ) m ( × m) ( × ) m ( × ) mRT . s . s . s . s . sSignal Noiseless signals from WSJ1 training databaseArray position in room 6 arbitrary positions in each roomSource-array distance . m with added noise with . varianceTable 2: Conﬁguration of test data generation. All rooms are m in heightSimulated test dataRoom 1 Room 2Room size ( × ) m ( × ) mRT . s . sSource-array distance . m . mSignal Noiseless signals from WSJ1 test databaseArray position in room 4 arbitrary positions in each roomWe then constructed the audio feature matrix R as described in Sec. 2.1. In the training phase, boththe location and a clean recording of each speaker were known, hence they could be used to generatethe labels. For each TF bin ( l, k ) , the dominating speaker was determined by:dominant speaker ← argmax i | s i ( l, k ) h i ref ( l, k ) | . (8)The ground-truth label D l,k is the DOA of the dominant speaker. The training set comprised fourhours of recordings with 30000 different scenarios of mixtures of two speakers. It is worth noting thatas the length of each speaker recording was different, the utterances could also include non-speech orsingle-speaker frames. The network was trained to minimize the cross-entropy between the correctand the estimated DOA. The cross-entropy cost function was summed over all the images in thetraining set. The network was implemented in Tensorﬂow with the ADAM optimizer [12]. Thenumber of epochs was set to be 100, and the training stopped after the validation loss increased for 3successive epochs. The mini-batch size was set to be images. In this section we evaluate the TF-DOAnet and compare its performance to classic and DNN-based algorithms. To objectively evaluate the performance of the TF-DOAnet, we ﬁrst simulated 2unfamiliar test rooms. Then, we tested our TF-DOAnet with real RIR recordings in different rooms.Finally, a real-life scenario with fast moving speakers was recorded and tested.For each test scenario, we selected two speakers from the test set of the WSJ1 database [15], placedthem at two different angles between ◦ and ◦ relative to the microphone array, at a distance ofeither m or m. The signals were generated by convolving the signals with RIRs corresponding tothe source positions and with either simulated or recorded acoustic scenarios. Performance measures

Two different measures to objectively evaluate the results were used: themean absolute error (MAE) and the localization accuracy (Acc.). The MAE, computed between the5rue and estimated DOAs for each evaluated acoustic condition, is given byMAE ( ◦ ) = 1 N · C C (cid:88) c =1 min π ∈ S N N (cid:88) n =1 | θ cn − ˆ θ cπ ( n ) | , (9)where N is the number of simultaneously active speakers and C is the total number of speech mixturesegments considered for evaluation for a speciﬁc acoustic condition. In our experiments N = 2 .The true and estimated DOAs for the n -th speaker in the c -th mixture are denoted by θ cn and ˆ θ cn ,respectively.The localization accuracy is given by Acc. (%) = ˆ C acc. C × (10)where ˆ C acc. denotes the number of speech mixtures for which the localization of the speakers isaccurate. We considered the localization of speakers for a speech frame to be accurate if the distancebetween the true and the estimated DOA for all the speakers was less than or equal to ◦ . Compared algorithms

We compared the performance of the TF-DOAnet with two frequentlyused baseline methods, namely the MUSIC and SRP-PHAT algorithms. In addition, we comparedits performance with the CNN multi-speaker DOA (CMS-DOA) estimator [4]. To facilitate thecomparison, the MUSIC pseudo-spectrum was computed for each frequency sub-band and for eachSTFT time frame, with an angular resolution of ◦ over the entire DOA domain. Then, it wasaveraged over all frequency subbands to obtain a broadband pseudo-spectrum followed by averagingover all the time frames L . Next, the two DOAs with the highest values were selected as the ﬁnalDOA estimates. Similar post-processing was applied to the computed SRP-PHAT pseudo-likelihoodfor each time frame. We ﬁrst generated a test dataset with simulated RIRs. Two differentrooms were used, as described in Table 2. For each scenario, two speakers (male or female) wererandomly drawn from the WSJ1 test database, and placed at two different DOAs within the range { , , . . . , } relative to the microphone array. The microphone array was similar to the one usedin the training phase. Using the RIR generator, we generated the RIR for the given scenario andconvolved it with the speakers’ signals.The results for the TF-DOAnet compared with the competing methods are depicted in Table 3. Thetables shows that the deep-learning approaches outperformed the classic approaches. The TF-DOAnetachieved very high scores and outperformed the DNN-based CMS-DOA algorithm in terms of bothMAE and accuracy. Static real recordings scenario

The best way to evaluate the capabilities of the TF-DOAnet istesting it with real-life scenarios. For this purpose, we ﬁrst carried out experiments with real measuredRIRs from a multi-channel impulse response database [11]. The database comprises RIRs measuredin an acoustics lab for three different reverberation times of RT = 0 . , . , and . s. Thelab dimensions are × × . m.The recordings were carried out with different DOA positions in the range of [0 ◦ , ◦ ] , in steps of ◦ . The sources were positioned at distances of m and m from the center of the microphone array.The recordings were carried out with a linear microphone array consisting of microphones withthree different microphone spacings. For our experiment, we chose the [8, 8, 8, 8, 8, 8, 8] cm setup.In order to construct an array setup identical to the one in the training phase, we selected a sub-arrayof the four center microphones out of the total 8 microphones in the original setup. Consequently, weused a uniform linear array (ULA) with M = 4 elements with an inter-microphone distance of cm.The results for the TF-DOAnet compared with the competing methods are depicted in Table 4. Again,the TF-DOAnet outperforms all competing methods, including the CMS-DOA algorithm. Interest-ingly, for the 1 m case, the best results for the TF-DOAnet were obtained for the highest reverberation the trained model is available here https://github.com/Soumitro-Chakrabarty/Single-speaker-localization a) Room view. (b) Speakers’ trajectory. Figure 2: Real-life experiment setup.Table 3: Results for two different test rooms with simulated RIRsTest Room Room 1 Room 2Measure MAE Acc. MAE Acc.MUSIC [7] 26.2 28.4 31.5 16.9SRP-PHAT [2] 25.1 26.7 35.0 15.6CMS-DOA [4] 13.1 71.1 24.0 38.1TF-DOAnet level, namely RT = 610 ms, and for the 2 m case, for RT = 360 ms. While surprising at ﬁrstglance, this can be explained using the following arguments. There is an accumulated evidencethat reverberation, if properly addressed, can be beneﬁcial in speech processing, speciﬁcally formulti-microphone speech enhancement and source extraction [10, 14, 8] and for speaker localization[6, 13]. In reverberant environments, the intricate acoustic propagation pattern constitutes a speciﬁc“ﬁngerprint” characterizing the location of the speaker(s). When reverberation level increases, thisﬁngerprint becomes more pronounced and is actually more informative than its an-echoic counterpart.An inference methodology that is capable of extracting the essential driving parameters of the RIRwill therefore improve when the reverberation is higher. If the acoustic propagation becomes evenmore complex, as is the case of high reverberation and a remote speaker, a slight performancedegradation may occur, but as evident from the localization results, for sources located 2 m from thearray, the performance for RT = 610 ms was still better than the performance for RT = 160 ms. Real-life dynamic scenario

To further evaluate the capabilities of the TF-DOAnet, we also carriedout real dynamic scenarios experiments. The room dimensions are × × . m. The roomreverberation level can be adjusted and we set the RT at two levels, ms and ms, respectively.The microphone array consisted of microphones with an inter-microphone spacing of cm. Thespeakers walked naturally on an arc at a distance of about . m from the center of the microphonearray. For each RT two experiments were recorded. The two speakers started at the angles ◦ and ◦ and walked until they reached ◦ and ◦ , respectively, turned around and walked backto their starting point. This was done several times throughout the recording. Figure 2a depicts thereal-life experiment setup and Fig. 2b depicts a schematic diagram of the setup of this experiment.The ground truth labels of this experiment were measured with the Marvelmind indoor 3D trackingset. Figures 3 and 4 depict the results of the two experiments. It is clear that the TF-DOAnet outperformedthe CMS-DOA algorithm, especially for the high RT conditions. Whereas the CMS-DOA ﬂuctuatedrapidly, the TF-DOAnet output trajectory was smooth and noiseless. https://marvelmind.com/product/starter-set-ia-02-3d/ a) Ground truth. (b) CMS-DOA. (c) The TF-DOAnet. Figure 3: Real-life recording of two moving speakers in a × × . room with RT = 390 ms. (a) Ground truth. (b) CMS-DOA. (c) The TF-DOAnet. Figure 4: Real-life recording of two moving speakers in a × × . room with RT = 720 ms.Table 4: Results for three different rooms at distances of m and m with measured RIRsDistance m mRT In our implementation, we used the real and imaginary part of the RTF (4). Other approaches mightbe beneﬁcial. For example, in [5], the cos and the sin of the phase of the RTF were used. In otherapproaches, the spectrum was added to the spatial features [25].In this section, the different features were tested with the same model. We compared the proposedfeatures with two other features. First, we used the proposed features as described in (4). The secondapproach was a variant of our approach with the spectrum added (‘TF-DOAnet with Spec.’). Thethird, used the cos and the sin features as presented in [5] (‘Cos-Sin’). All features were crafted fromthe same training data described in Sec. 2.3. We tested the different approaches in the test conditionsdescribed in 2.First, it is clear that all the features with our high resolution TF model outperformed the frame-basedCMS-DOA algorithm, as reported in Table 3. This conﬁrms that the TF supervision is beneﬁcial forthe task at hand. Second, the proposed features were shown to be better than the Cos-Sin features.Finally, it is very interesting to note that the addition of the spectrum features slightly deteriorated theresults for this task.

A FCN approach was presented in this paper for the DOA estimation task. Instantaneous RTF featureswere used to train the model. The high TF resolution facilitated the tracking of multiple movingspeakers simultaneously. A comprehensive experimental study was carried out with simulated andreal-life recordings. The proposed approach outperformed both the classic and CNN-based SOTAalgorithms in all experiments. Training and test datasets which represent different real-life scenarioswere constructed as a DOA benchmark and will become available after publication.8able 5: Ablation study results with different featuresTest Room Room 1 Room 2Measure MAE Acc. MAE Acc.Cos-Sin 1.2 96.1 2.8 91.3TF-DOAnet with Spec. 0.6 98.4 3.3 86.7TF-DOAnet

Broader impact

Several modern technologies can beneﬁt from the proposed localization algorithm. We alreadymentioned the emerging technology of smart speakers in the Introduction. These devices are equippedwith multiple microphones and are implementing location-speciﬁc tasks, e.g. the extraction of thespeaker of interest. Of particular interest are socially assistive robots (SARs), as they are likely toplay an important role in healthcare and psychological well-being, in particular during non-medicalphases inherent to any hospital process.The algorithm neither uses the content nor the identity of the speakers and hence does not to violatethe privacy of the users. Moreover, since normally speech signal cannot propagate over long distances,the algorithm application is limited to small enclosures.

References [1] Jont B. Allen and David A. Berkley. Image method for efﬁciently simulating small-roomacoustics.

The Journal of the Acoustical Society of America , 65(4):943–950, 1979.[2] Michael S. Brandstein and Harvey F. Silverman. A robust method for speech signal time-delayestimation in reverberant rooms. In

IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 1997.[3] Soumitro Chakrabarty and Emanuël A. P. Habets. Broadband DOA estimation using convolu-tional neural networks trained with noise signals.

IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics (WASPAA) , 2017.[4] Soumitro Chakrabarty and Emanuël A. P. Habets. Multi-speaker DOA estimation using deepconvolutional networks trained with noise signals.

IEEE Journal of Selected Topics in SignalProcessing , 13(1):8–21, 2019.[5] Shlomo E. Chazan, Hodaya Hammer, Gershon Hazan, Jacob Goldberger, and Sharon Gannot.Multi-microphone speaker separation based on deep DOA estimation. In

European SignalProcessing Conference (EUSIPCO) , 2019.[6] Antoine Deleforge, Florence Forbes, and Radu Horaud. Acoustic space learning for sound-source separation and localization on binaural manifolds.

International journal of neuralsystems , 25(01):1440003, 2015.[7] Jacek P. Dmochowski, Jacob Benesty, and Soﬁene Affes. Broadband music: Opportunities andchallenges for multiple source localization. In

Proc. IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics (WASPAA) , 2007.[8] Ivan Dokmani´c, Robin Scheibler, and Martin Vetterli. Raking the cocktail party.

IEEE journalof selected topics in signal processing , 9(5):825–836, 2015.[9] Christine Evers, Heinrich Loellmann, Heinrich Mellmann, Alexander Schmidt, Hendrik Barfuss,Patrick Naylor, and Walter Kellermann. The locata challenge: Acoustic source localization andtracking. arXiv preprint arXiv:1909.01008 , 2019.[10] Sharon Gannot, David Burshtein, and Ehud Weinstein. Signal enhancement using beamformingand nonstationarity with applications to speech.

IEEE Transactions on Signal Processing ,49(8):1614–1626, 2001. 911] Elior Hadad, Florian Heese, Peter Vary, and Sharon Gannot. Multichannel audio database invarious acoustic environments. In

International Workshop on Acoustic Signal Enhancement(IWAENC) , 2014.[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[13] Bracha Laufer-Goldshtein, Ronen Talmon, and Sharon Gannot. Semi-supervised sound sourcelocalization based on manifold regularization.

IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , 24(8):1393–1407, 2016.[14] Shmulik Markovich-Golan, Sharon Gannot, and Israel Cohen. Multichannel eigenspace beam-forming in a reverberant noisy environment with multiple interfering speech signals.

IEEE Tran.on Audio, Speech, and Language Processing , 17(6):1071–1086, August 2009.[15] Douglas B. Paul and Janet M. Baker. The design for the Wall Street Journal-based CSR corpus.In

Workshop on Speech and Natural Language , 1992.[16] Hadrien Pujol, Eric Bavu, and Alexandre Garcia. Source localization in reverberant rooms usingdeep learning and microphone arrays. In

International Congress on Acoustics (ICA) , 2019.[17] Scott Rickard and Ozgiir Yilmaz. On the approximate w-disjoint orthogonality of speech. In

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , 2002.[18] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks forbiomedical image segmentation. In

International Conference on Medical image computing andcomputer-assisted intervention , 2015.[19] Ralph Schmidt. Multiple emitter location and signal parameter estimation.

IEEE Transactionson Antennas and Propagation , 34(3):276–280, 1986.[20] Sebastian Stenzel, Jürgen Freudenberger, and Gerhard Schmidt. A minimum variance beam-former for spatially distributed microphones using a soft reference selection. In

Joint Workshopon Hands-free Speech Communication and Microphone Arrays (HSCMA) , 2014.[21] Ryu Takeda and Kazunori Komatani. Discriminative multiple sound source localization basedon deep neural networks using independent location model.

IEEE Spoken Language TechnologyWorkshop (SLT) , 2016.[22] Juan Manuel Vera-Diaz, Daniel Pizarro, and Javier Macias-Guarasa. Towards end-to-endacoustic localization using deep learning: From audio signals to source position coordinates.

Sensors , 18(10):3418, 2018.[23] Fabio Vesperini, Paolo Vecchiotti, Emanuele Principi, Stefano Squartini, and Francesco Piazza.A neural network based algorithm for speaker localization in a multi-room environment. In

IEEE International Workshop on Machine Learning for Signal Processing (MLSP) , 2016.[24] DeLiang Wang and Jitong Chen. Supervised speech separation based on deep learning: Anoverview.

IEEE/ACM Transactions on Audio, Speech, and Language Processing , 26(10):1702–1726, 2018.[25] Zhong-Qiu Wang, Jonathan Le Roux, and John R. Hershey. Multi-channel deep clustering:Discriminative spectral and spatial embeddings for speaker-independent speech separation. In

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2018.[26] Xiong Xiao, Shengkui Zhao, Xionghu Zhong, Douglas L Jones, Eng Siong Chng, and HaizhouLi. A learning-based approach to direction of arrival estimation in noisy and reverberantenvironments. In

IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2015.[27] Ozgur Yilmaz and Scott Rickard. Blind separation of speech mixtures via time-frequencymasking.