[PDF] Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition

Abstract

Multi-source localization is an important and challenging technique for multi-talker conversation analysis. This paper proposes a novel supervised learning method using deep neural networks to estimate the direction of arrival (DOA) of all the speakers simultaneously from the audio mixture. At the heart of the proposal is a source splitting mechanism that creates source-specific intermediate representations inside the network. This allows our model to give source-specific posteriors as the output unlike the traditional multi-label classification approach. Existing deep learning methods perform a frame level prediction, whereas our approach performs an utterance level prediction by incorporating temporal selection and averaging inside the network to avoid post-processing. We also experiment with various loss functions and show that a variant of earth mover distance (EMD) is very effective in classifying DOA at a very high resolution by modeling inter-class relationships. In addition to using the prediction error as a metric for evaluating our localization model, we also establish its potency as a frontend with automatic speech recognition (ASR) as the downstream task. We convert the estimated DOAs into a feature suitable for ASR and pass it as an additional input feature to a strong multi-channel and multi-talker speech recognition baseline. This added input feature drastically improves the ASR performance and gives a word error rate (WER) of 6.3% on the evaluation data of our simulated noisy two speaker mixtures, while the baseline which doesn't use explicit localization input has a WER of 11.5%. We also perform ASR evaluation on real recordings with the overlapped set of the MC-WSJ-AV corpus in addition to simulated mixtures.

Full PDF

DDeep Learning based Multi-Source Localization with Source Splitting andits Eﬀectiveness in Multi-Talker Speech Recognition

Aswin Shanmugam Subramanian a, ∗ , Chao Weng c , Shinji Watanabe b,a , Meng Yu d , Dong Yu d a Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA b Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA c Tencent AI Lab, Shenzhen, China d Tencent AI Lab, Bellevue, WA, USA

Abstract

Multi-source localization is an important and challenging technique for multi-talker conversation analysis.This paper proposes a novel supervised learning method using deep neural networks to estimate the directionof arrival (DOA) of all the speakers simultaneously from the audio mixture. At the heart of the proposalis a source splitting mechanism that creates source-speciﬁc intermediate representations inside the network.This allows our model to give source-speciﬁc posteriors as the output unlike the traditional multi-labelclassiﬁcation approach. Existing deep learning methods perform a frame level prediction, whereas ourapproach performs an utterance level prediction by incorporating temporal selection and averaging insidethe network to avoid post-processing. We also experiment with various loss functions and show that avariant of earth mover distance (EMD) is very eﬀective in classifying DOA at a very high resolution bymodeling inter-class relationships. In addition to using the prediction error as a metric for evaluating ourlocalization model, we also establish its potency as a frontend with automatic speech recognition (ASR)as the downstream task. We convert the estimated DOAs into a feature suitable for ASR and pass it asan additional input feature to a strong multi-channel and multi-talker speech recognition baseline. Thisadded input feature drastically improves the ASR performance and gives a word error rate (WER) of 6.3%on the evaluation data of our simulated noisy two speaker mixtures, while the baseline which doesn’t useexplicit localization input has a WER of 11.5%. We also perform ASR evaluation on real recordings withthe overlapped set of the MC-WSJ-AV corpus in addition to simulated mixtures.

Keywords: source localization, multi-talker speech recognition

1. Introduction

Source localization, also known as direction of arrival (DOA) estimation, is the task of estimating thedirection of the sound sources with respect to the microphone array. This localization knowledge can aidmany downstream applications. For example, they are used in robot audition systems [1, 2] to facilitateinteraction with humans. Source localization is pivotal in making the human-robot communication morenatural by rotating the robot’s head. The localization component in the robot audition pipeline also aids insource separation and recognition. Moreover, there is growing interest in using far-ﬁeld systems to processmulti-talker conversations like meeting transcription [3]. Incorporating the source localization functionalitycan enrich such systems by monitoring the location of speakers and also potentially help improve the per-formance of the downstream automatic speech recognition (ASR) task. DOA estimation can also be usedwith smart speaker scenarios [4, 5] and potentially aid with better audio-visual fusion. ∗ Corresponding Author. A part of the work was carried out by this author during an intership at Tencent AI Lab, Bellevue,USA.

Email addresses: [email protected] (Aswin Shanmugam Subramanian), [email protected] (Chao Weng), [email protected] (Shinji Watanabe), [email protected] (Meng Yu), [email protected] (Dong Yu)

Preprint submitted to Computer Speech & Language February 17, 2021 a r X i v : . [ ee ss . A S ] F e b ome of the earlier DOA estimation techniques are based on narrowband subspace methods [6, 7]. Thesimple wideband approximation can be achieved by using the incoherent signal subspace method (ISSM),which uses techniques such as multiple signal classiﬁcation (MUSIC) independently on narrowband signalsof diﬀerent frequencies and average their results to obtain the ﬁnal DOA estimate. Broadband methodswhich better utilize the correlation between diﬀerent frequencies such as weighted average of signal sub-spaces (WAVES) [8] and test of orthogonality of projected subspaces (TOPS) [9] have shown promisingimprovements. Alternative cross-correlation based methods like steered response power with phase trans-form (SRP-PHAT) [10] have also been proposed. All these previously mentioned methods are based on signalprocessing, and they are not robust to reverberations as they were developed under a free-ﬁeld propagationmodel [11]. Recently, there has been a shift in interest to supervised deep learning methods to make theestimation robust to challenging acoustic conditions .The initial deep learning methods were developed for single source scenarios [12, 13, 14, 15, 16]. In [12]and [13], features are extracted from MUSIC and GCC-PHAT respectively. The learning is made morerobust in [14, 15, 16] by encapsulating the feature extraction inside the neural network. While [13] treatsDOA estimation as a regression problem, all the other methods formulate it as a classiﬁcation problem bydiscretizing the possible DOA angles. The modeling resolution which is determined by the output dimensionof the network is quite low in [12], [14], and [16] at 5 ◦ , 45 ◦ , and 5 ◦ respectively. Although the modelingresolution is high in [15] with classes separated by just 1 ◦ , the evaluation was performed only with a blocksize of 5 ◦ and it is also shown to be not robust to noise.The deep learning models were extended to handle multiple source cases by treating it as a multi-labelclassiﬁcation problem in [17, 11, 18]. This still gives only a single vector as output with each dimensiontreated as a separate classiﬁcation problem. This procedure is not good at capturing the intricate inter-classrelationship in our problem. So increasing the number of classes to perform classiﬁcation at a very highresolution will fail. During inference, the output vector is treated like a spatial spectrum and DOAs areestimated based on its peaks. In [18, 17], the number of peaks is determined based on a threshold. In [11],the number of sources are assumed to be known based on which the peaks are chosen at the output. In[17] and [11], the source locations are restricted to be only in a spatial grid with a step size of 10 ◦ and15 ◦ respectively. To make the models work on realistic conditions, it is crucial to handle all possible sourcelocations. Moreover, these models perform a frame level prediction in spite of the sources assumed to bestationary inside an utterance. Hence, post-processing is required to get the utterance level DOA estimate.In this work, we propose a source splitting mechanism, which involves treating the neural network astwo distinct and well deﬁned components. The ﬁrst component disentangles the input mixture by implicitlysplitting them as source-speciﬁc hidden representations. The second component then maps these features toas many DOA posteriors as the number of sources in the mixture. This makes the number of classiﬁcationproblems to be only equal to the number of sources and not the the number of DOA classes like the existingmethods. With the added help of loss functions that can handle the inter-class relationship better, ourmethods can classify DOA reliably at a high resolution. Like [11], we assume the knowledge of number ofsources in the utterance but we incorporate it directly in the architecture of the model instead of using itwhile post-processing.We evaluate our proposed localization method both as a standalone task and also based on its eﬀectivenessas a frontend to aid speech recognition. Assuming the DOA to be known, the eﬀectiveness of using featuresextracted from the ground-truth location for target speech recognition was shown in [19, 20] as a proof ofconcept. The importance of localization for multi-talker speech recognition was also shown in our previouswork [21]. In [21], we proposed a model named directional ASR (D-ASR), which can learn to predict DOAwithout explicit supervision by joint optimization with speech recognition. D-ASR replaces the typicalseparation subnetwork used in multi-talker speech recognition with a localization subnetwork. AlthoughD-ASR works well for clean speech mixtures, it is not robust to additive stationary noise. In the second partof this paper, we devise a more robust approach to D-ASR that can improve multi-talker speech recognition.We use a strong multi-talker speech recognition system called MIMO-Speech [22, 23] as a baseline. The DOAsestimated from our proposed models in the ﬁrst part are converted to angle features [19]. These features aregiven as additional inputs to the MIMO-Speech model and tested for its eﬀectiveness in improving speechrecognition. It is hard to obtain DOA labels for real datasets. We perform DOA estimation for real speech2 igure 1: Multi-label Classiﬁcation Model for 2-source ScenarioFigure 2: Proposed Model with Source Splitting Mechanism for 2-source Scenario mixtures from multichannel wall street journal audio-visual (MC-WSJ-AV) corpus and evaluate the qualityof localization solely based on the downstream ASR metric of word error rate (WER).

2. Deep Learning based Multi-Source Localization

Given a multi-channel input signal with multiple sources, a deep neural network can be trained topredict all the azimuth angles corresponding to each of the sources. Typically it is treated as a multi-labelclassiﬁcation problem [11, 17], as shown in Figure 1. Here, the network outputs only a single vector whichis supposed to encode the DOA of all the sources. It is possible to have spurious peaks in angle classes thatare in close proximity to any of the target angle classes, especially when the number of classes are increased.As more than one peak needs to be selected from the vector during inference, it is possible a spurious peakwill also be chosen at the expense of a diﬀerent source.We propose an alternative modeling approach for multi-source localization. In this model, the networkis divided into two well-deﬁned components with speciﬁc functionalities as shown in Figure 2. The ﬁrstcomponent is the source splitter , which dissects the input mixed signal to extract source speciﬁc spatialfeatures. These features are passed to the second component named source dependent predictors , whichgives individual posterior vectors as output. As multiple outputs are obtained, each one has a well deﬁnedtarget with just one target angle corresponding to a speciﬁc source.As we will have N output predictions corresponding to the N sources in our proposed model, there willbe source permutation ambiguity in how we assign the predictions to the targets. We handle it in two ways.First we can use permutation invariant training (PIT) [24] that are popular for source separation models.3n PIT, all possible prediction-target assignment permutations are considered a valid solution and the onewith the minimum loss is chosen during training. Alternatively, we ﬁx the target angles to be in ascendingorder and force only this permutation.Although the sources are stationary in existing models like [11], a frame level prediction is performed andthen time averaging is used as a post processing step during inference. It is common in tasks like speakeridentiﬁcation, which involve converting a sequence of features into a vector to incorporate average poolinginside the network [25, 26]. We follow this approach in all our models to avoid post-processing.We describe the diﬀerent deep learning architectures used in this work. First we describe an architecturebased on multi-label classiﬁcation. Then we introduce two diﬀerent architectures based on our proposedsource splitting mechanism. Let Y , Y , · · · , Y M be the M -channel input signal in the short-time Fouriertransform (STFT) domain, with Y m ∈ C T × F , where T is the number of frames and F is the number offrequency components. The input signal is reverberated, consisting of N speech sources. There can beadditive stationary noise but we assume that there are no point source noises. We also assume that N isknown. In this work the array geometry is considered to be a uniform circular arrays (UCA) and only theazimuth angle is estimated. This model is designed to take the raw multi-channel phase as the feature. Let the phase spectrum ofthe multichannel input signal be represented as

P ∈ [0 , π ] T × M × F . This raw phase P is passed throughthe ﬁrst component of the localization network based on a convolutional neural network (CNN) given byLocNet-CNN( · ) to extract phase feature Z by pooling the channels as follows: Z = LocNet-CNN( P ) ∈ R T × Q , (1)where Q is the feature dimension. The LocNet-CNN( · ) architecture is inspired from [11], which uses convo-lutional ﬁlters across the microphone-channel dimension to learn phase diﬀerence like features.The phase feature Z from Eq. (1) is passed through a bidirectional long short-term memory (BLSTM)component LocNet-BLSTM-MLC( · ) as follows, W = LocNet-BLSTM-MLC( Z ) , (2)where W ∈ R T × Q is an intermediate feature. We consider that the sources are stationary within anutterance. So in the next step a simple time average is performed to obtain a summary vector. ξ ( q ) = 1 T T (cid:88) t =1 w ( t, q ) , (3)where ξ ( q ) is the summary vector at dimension q , and w ( t, q ) is the intermediate feature at time t andfeature dimension q . The summary vector, represented in vector form as ξ ∈ R Q is passed through alearnable AﬃneLayer( · ) to convert its dimension to (cid:98) /γ (cid:99) , where γ is the angle resolution in degrees todiscretize the DOA angle. κ = σ (AﬃneLayer( ξ )) . (4)(5)where, σ ( · ) is the sigmoid activation, and κ ∈ (0 , (cid:98) /γ (cid:99) is the multi-label classﬁcation vector. TheDOAs can be estimated by ﬁnding the indices in κ corresponding to the N largest peaks.4 .2. CNN-BLSTM Model with Source Splitting Mechanism (CNN-BLSTM-SS) This architecture is inspired from the localization subnetwork used in D-ASR [21]. This model also usesraw phase as the input and the phase feature Z is extracted in the same way as Section 2.1 using Eq. (1). Z will have DOA information about all the sources in the input signal. It is processed by the next componentLocNet-Mask( · ) which consists of BLSTM layers. Source splitting is achieved through this component whichextracts source-speciﬁc binary masks as follows,[ W n ] Nn =1 = σ (LocNet-Mask( Z )) , (6)where W n ∈ [0 , T × Q is the feature mask for source n and σ ( · ) is the sigmoid activation. This mask segments Z into regions that correspond to each source. Note that LocNet-Mask( · ) can also implicitly perform voicedactivity detection as it outputs binary masks.The extracted phase mask from Eq. (6) is used to perform a weighted averaging of the phase featurefrom Eq. (1) to get source-speciﬁc summary vectors. The summary vector will encode the DOA informationspeciﬁc to a source as the masks are used as weights to summarize information only from the correspondingsource regions in the following way, ξ n ( q ) = (cid:80) Tt =1 w n ( t, q ) z ( t, q ) (cid:80) Tt =1 w n ( t, q ) , (7)where ξ n ( q ) is the summary vector for source n at dimension q , w n ( t, q ) ∈ [0 ,

1] and z ( t, q ) ∈ R are theextracted feature mask (for source n ) and the phase feature, respectively, at time t and feature dimension q .The summary vector, represented in vector form as ξ n ∈ R Q is passed through a learnable source-speciﬁc AﬃneLayer n ( · ), which acts as the predictor and converts the summary vector from dimension Q tothe dimension (cid:98) /γ (cid:99) , where γ is the angle resolution in degrees to discretize the DOA angle. Based onthis discretization, we can predict the DOA angle as a classiﬁer with the softmax operation. From this, wecan get the source-speciﬁc posterior probability for the possible angle classes as follows,[Pr( θ n = α i |P )] (cid:98) /γ (cid:99) i =1 = Softmax(AﬃneLayer n ( ξ n )) , (8) α i = (( γ ∗ i ) − (( γ − / π/ , (9)The estimated DOA ˆ θ n for source n is determined by ﬁnding the peak from the corresponding posteriorin Eq. (8) as follows, ˆ θ n = argmax α i Pr( θ n = α i |P ) , (10) In this model, inter-microphone phase diﬀerence (IPD) is computed and passed as a feature directly tothe model. There are no CNN layers in this model as pre-deﬁned features are used. The binary maskingis also removed in this model to estimate the summary vectors in a more direct way. The IPD features arecalculated as, p i ( t, f ) = 1 M [cos ∠ ( y i ( t, f ) y i ( t, f ) ) + j sin ∠ ( y i ( t, f ) y i ( t, f ) )] , i = 1 : I, (11)where y m ( t, f ) is the input signal at channel m , time t and frequency f , i represents an entry in a microphonepair list deﬁned for calculating the IPD; and i and i are the index of microphones in each pair. We calculateIPD features for I pairs and then concatenate their real and imaginary parts together. The concatenated5PD feature is represented as P ∈ R T × I × F . The magnitude of the input signal at channel 1 i.e. | Y | isalso added as a feature to give the ﬁnal input feature Z ∈ R T × (2 IF + F ) . This feature is passed to a BLSTMcomponent LocNet-BLSTM( · ), which is the source splitter, as follows,[ W n ] Nn =1 = σ (LocNet-BLSTM( Z )) , (12)where W n ∈ R T × Q is the source-speciﬁc feature for source n . As we don’t have a binary mask in thismodel, we take a simple average instead of the weighted average used in Eq. (7) to extract the summaryvector as follows, ξ n ( q ) = 1 T T (cid:88) t =1 w n ( t, q ) , (13)The summary vector is used similar to Section 2.2 and the source speciﬁc DOAs are obtained by followingEq. (8) and Eq. (10). The multi-label classiﬁcation model in Section 2.1 is trained with the binary cross entropy (BCE) loss.The models in Section 2.2 and Section 2.3 give source-speciﬁc probabilitie distributions at the output. Socross entropy (CE) loss will be the natural choice. This choice will be good for a typical classiﬁcationproblem where the inter-class relationship is not important. As we are estimating the azimuth angles, wehave classes that are ordered. So it is better to optimize by taking the class relationships into account asthey are very informative.Typically a 1-hot vector is given as target to the CE loss. In DOA estimation it is better to predict anearby angle compared to a far away angle. But the typical penalty from the CE loss doesn’t take this intoaccount. A loss function that takes the class ordering into account is the the earth mover distance (EMD).EMD is the minimal cost required to transform one distribution to another [27]. As in our problem theclasses are naturally in a sorted order, we can use the simple closed-form solution from [28]. The solutionis given by the squared error between the cumulative distribution function(CDF) of the prediction and thetarget probability distributions.Another way to induce inter-class relationship is by using a soft target probability distribution which isnot as sparse as the 1-hot vector. We use the following target distribution: χ ( i ) =  . i = ψ . i = ( ψ ± (cid:98) /γ (cid:99) . i = ( ψ ± (cid:98) /γ (cid:99) elsewhere (14)where, χ ( i ) is the probability weight of the target distribution for class i and ψ is the index correspondingto the target angle class. When this soft target distribution is used with CE, we deﬁne it as the soft crossentropy (SCE) loss. Similarly, it can also be used with EMD and we deﬁne it as soft earth mover distance (SEMD) loss. By assigning some probability weight to the angle classes in the neighbourhood of the target,the network can be potentially made to learn some inter-class relationship.

3. Integration of Localization with ASR

MIMO-Speech [22, 23] is an end-to-end neural network that can perform simultaneous multi-talker speechrecognition by taking the mixed speech multi-channel signal as the input. The MIMO-Speech networkconsists of three main components: (1) The masking subnetwork, (2) diﬀerentiable minimum variance6istortionless response (MVDR) beamformer, and (3) the ASR subnetwork. The masking network in MIMO-Speech is monaural and gives channel-dependent time-frequency masks for beamforming. For this only themagnitude of the corresponding channel is given as input to the masking network. We will modify thismasking network to take composite multi-channel input and output a channel-independent mask which willbe shared across all channels.Firstly, it is possible to also get a multi-channel masking network baseline by using Z from Section 2.3as the composite feature which is the concatenation of IPD features from Eq. (11) and the magnitude ofthe ﬁrst channel. For this to be further extended to take localization knowledge we need to concatenate itwith an additional input feature that can encode the estimated azimuth angle. The groundtruth DOA isencoded as angle features in [19, 20, 29]. We follow the same procedure but use the estimated DOA fromour models proposed in Section 2.The steering vector d n ( f ) ∈ C M for source n and frequency f is calculated from estimated DOA ˆ θ n . Inthis work we have used uniform circular arrays (UCA) and the steering vector is calculated as follows, τ nm = rc cos(ˆ θ n − ψ m ) , m = 1 : M (15) d n ( f ) = [ e j πfτ n , e j πfτ n , ..., e j πfτ nM ] , (16)where τ nm is the signed time delay between the m -th microphone and the center for source n , ψ i is theangular location of microphone m , r is the radius of the UCA and c is the speed of sound (343 m/s).The calculated steering vector is used to compute the angle features with the following steps,˜ a n ( t, f ) = | d n ( f ) H y ( t, f ) | , (17) a n ( t, f ) = ˜ a n ( t, f ) ∗ I (˜ a n ( t, f ) − ˜ a s ( t, f )) s =1: N , (18)where a n ( t, f ) is the angle feature for source n at time t and frequency f , H is conjugate transpose, y ( t, f ) ∈ C M is the multichannel input, N is the number of speakers, and I ( . ) is the indicator function thatoutputs 0 if the input diﬀerence is negative for any of the “ s = 1 : N ” cases and 1 otherwise.The computed angle features for all the sources are concatenated and included to the composite featureinput list that is fed to the multi-channel masking network. This subnetwork given by MaskNet( · ) givesthe source speciﬁc masks L n ∈ (0 , T × F as the output. Rest of the procedure is similar to the originalMIMO-Speech model. The masks are used to compute the source speciﬁc spatial covariance matrix (SCM), Φ n ( f ) as follows: Φ n ( f ) = 1 (cid:80) Tt =1 l n ( t, f ) T (cid:88) t =1 l n ( t, f ) y ( t, f ) y ( t, f ) H , (19)The interference SCM, Φ n intf ( f ) for source n is approximated as (cid:80) Ni (cid:54) = n Φ i ( f ) like [22] (we experimentonly with N = 2, so no summation in that case). From the computed SCMs, the M -dimensional complexMVDR beamforming ﬁlter [30] for source n and frequency f , b n MVDR ( f ) ∈ C M is estimated as, b n MVDR ( f ) = [ Φ n intf ( f ) + Φ noise ( f )] − Φ n ( f )Tr([ Φ n intf ( f ) + Φ noise ( f )] − Φ n ( f )) u , (20)where u ∈ { , } M is a one-hot vector to choose a reference microphone, Tr( · ) denotes the trace operation,and Φ noise ( f ) is the noise SCM. The noise SCM can be estimated by obtaining an additional mask fromthe masking network. However we consider only stationary noise in this study and we experimentally foundthat it was better to ignore the noise SCM by considering it as an all-zero matrix.With the estimated MVDR ﬁlter, we can perform speech separation to obtain the n -th separated STFTsignal, x n ( t, f ) ∈ C as follows: 7 n ( t, f ) = b n ( f ) H y ( t, f ) , (21)This separated signal for source n , represented in matrix form as X n ∈ C T × F is transformed to a fea-ture suitable for speech recognition by performing log Mel ﬁlterbank transformation and utterance basedmean-variance normalization (MVN). The extracted feature O n for source n is passed to the speech recog-nition subnetwork ASR( · ) to get C n = ( c n , c n , · · · ), the token sequence corresponding to source n . TheMIMO-Speech network is optimized with the reference text transcriptions [ C iref ] Ni =1 as the target. The jointconnectionist temporal classiﬁcation (CTC)/attention loss [31] is used as the ASR optimization criteria.Here again there is permutation ambiguity as there are multiple output sequences. We can solve it intwo ways. First, we can follow the PIT scheme similar to the original MIMO-Speech model to resolve theprediction-target token sequence assignment problem. This takes additional computation time. We proposeto use the DOA knowledge to resolve the ambiguity in the following way. During training stage, we canuse the groundtruth DOA as input instead of the estimated DOA for computing the angle features. Wedetermine the permutation of the target sequence based on the order of sources in which the angle featuresare concatenated and thereby can eliminate PIT.

4. Experiments

Table 1: The conﬁgurations used for simulating the clean & noisy 2-speaker mixtures

Clean 2-mix Noisy 2-mix

Simulation corpus WSJ0 WSJCAM0ASR pretraining corpus WSJ WSJNoise corpus N/A REVERB challengeSampling rate 16 KHz 16 KHzNum. utterances Train - 12776 Train - 7860Dev - 1206 Dev - 742Eval - 651 Eval - 1088num. RIR Train- 3194 Train - 2620Dev - 1206 Dev - 742Eval - 651 Eval - 1088T60 0.15 - 0.5 s 0.25 - 0.7 sNum. channels 6 8UCA radius 5cm 10cmSource distance 1.5 - 3m 1 - 2mfrom arraySNR N/A 10 - 20 dBWe simulated two types of 2-speaker mixtures: (1) clean mixtures that don’t have any additive stationarynoise simulated from the subset WSJ0 of the wall street journal (WSJ) corpus [32], which is same as usedin [21], (2) noisy mixtures from WSJCAM0 [33] that also uses noise from REVERB corpus [34]. The noisymixtures are generated with the same array geometry as REVERB and also the real MC-WSJ-AV [35]corpus. The 5k vocabulary subset of the real overlapped data with stationary sources from MC-WSJ-AV8as used just for evaluation with the models trained with the noisy mixtures. For each utterance, wemixed another utterance from a diﬀerent speaker within the same set, so the resulting simulated data is thesame size as the original clean data. The SMS-WSJ [36] toolkit was used for creating the simulated datawith maximum overlap. Image method [37] was used to create the room impulse responses (RIR). Roomconﬁgurations with the size (length-width-height) ranging from 5m-5m-2.6m to 11m-11m-3.4m were usedwhile creating both types of mixtures. Both sources and the array are always chosen to be at the sameheight. The other conﬁguration details used for simulation are shown in Table 1For the six microphone array the following pair list: (1, 4), (2, 5), (3, 6), (1, 2), (3,4), and (5, 6) wereused to compute the IPD features deﬁned in Eq (11). The pair list for the eight microphone array was: (1,5), (2, 6), (3, 7), (4, 8), (1, 3), (3,5), (5,7), and (7, 1). Three CNN blocks with rectiﬁed linear unit (ReLU)activation after each block followed by a feedforward layer were used as LocNet-CNN( · ) deﬁned in Eq. (1).The CNN ﬁlters are applied across the channel-frequency dimensions. The kernel shapes were dependenton the number of input channels. For 6-microphone conﬁguration kernels of shapes, 3 ×

1, 2 ×

3, and 3 × ×

1, 3 ×

3, and 3 × Q was ﬁxed as “2 × (cid:98) /γ (cid:99) ”. One output gate projected bidirectional long short-term memory(BLSTMP) layer with Q cells was used as LocNet-Mask( · ) deﬁned in Eq. (6).The masking network for MIMO-Speech was designed with two BLSTMP layers with 771 cells. Theencoder-decoder ASR network was based on the Transformer architecture [38] and it was initialized witha pretrained model that used single speaker training utterances from both WSJ0 and WSJ1. The samearchitecture as [23] was used for the encoder-decoder ASR model. It has 12 layers in the encoder and 6layers in the decoder. Before the Transformer encoder, the log Mel ﬁlterbank features of 80 dimensions areencoded by two CNN blocks. The CNN layers have a kernel size of 3 × We use two popular subspace-based signal processing methods MUSIC and TOPS as baselines. For both,the spatial response is computed and the top two peaks are detected to estimate the DOA. The results ofD-ASR is also shown for comparison. The average absolute cyclic angle diﬀerence between the predictedangle and the ground-truth angle in degrees is used as the metric. The permutation of the prediction withthe reference that gives the minimum error is chosen.The results with and without WPE preprocessing for both the clean and noisy mixtures are given in Table2. Most of the supervised deep learning models with diﬀerent conﬁgurations perform signiﬁcantly betterthan the subspace methods. We can see that MUSIC and TOPS are not very sensitive to the diﬀerence inangle resolution ( γ ). The results show that the deep learning methods are robust to reverberations and givegood results even without the WPE frontend. D-ASR [21] results from row 5 perform very well on the cleanmixture but we can see that it is not robust to noise. The results of D-ASR on the noisy mixture is muchworse compared to even the subspace methods.The results of the multi-label classiﬁcation model from Section 2.1 are given in rows 6 & 7, We can seethat a higher resolution of γ = 5 degrades the performance from γ = 10. This is because of spurious peaks innearby angles when the resolution is increased. This shows the need for having separate output vectors for9 able 2: DOA absolute prediction error on both our clean and noisy simulated 2-source mixtures comparing our proposedsource-splitting methods with subspace methods, multi-label classiﬁcation, & D-ASR Row Method γ Loss PIT Clean Mixture Noisy MixtureID Function No WPE WPE No WPE WPEDev Test Dev Test Dev Test Dev Test (cid:51) (cid:55) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55)

18 BLSTM-SS 1 SEMD (cid:51) (cid:55) diﬀerent sources. Rows 8-19 gives the results of the models with the proposed source splitting mechanism.The results on all conﬁgurations are given with both PIT and also ﬁxing the targets in ascending order.From the results we can see that ﬁxing the target order works better.Rows 8-17 use the CNN-BLSTM-SS model proposed in Section 2.2 with phase feature masking. TheCE loss with γ = 10 (rows 8, 9) works reasonably well and gives an improvement over the multi-labelclassiﬁcation model. Increasing the resolution to γ = 1 (rows 10, 11) makes it poor because of its inabilityto learn inter-class relationship like explained in Section 2.4. Making the target distribution smoother withSCE loss alleviates the problem quite well and we can observe better results from row 13. Using EMD lossmakes it very robust and works well also while using PIT training (row 14). Combining both ideas withthe SEMD loss gives the best performance with a prediction error of around 1 . ◦ when PIT is not used(row 17). The results of BLSTM-SS model (row 19) are a bit worse on the noisy mixture compared to theCNN-BLSTM-SS model. The speech recognition performance of our proposed DOA integration compared with the vanilla MIMO-Speech baseline is given in Table 3. Word error rate (WER) is used as the metric for ASR and the results areshown for both clean and noisy mixtures. The results of the clean single-speaker data with the pretrainedASR model is given in row 1. Note that the results of the clean single-speaker data used for the noisymixture is not very good because it is from the British English corpus WSJCAM and the pretrained modelwas trained with American English data from WSJ. This accent diﬀerence is ﬁxed in the mixed speechmodels as the ASR network will be ﬁne tuned in that case with British English data. The ASR results ofthe simulated mixture using the single-speaker model is also shown (WER more than 100% because of toomany insertion errors).Oracle experiment by directly giving the reference ideal binary masks (IBM) to the beamformer is givenin row 3. We can see that D-ASR works quite well on the clean mixture like shown in [21] but it is not10 able 3: ASR performance on the simulated clean & noisy 2-speaker mixtures comparing our proposed DOA integration methodwith vanilla MIMO-Speech. Word error rate (WER) is used as the metric for comparison. For WER, lower the better.

Row DOA DOA IPD ASR Clean Mixture Noisy MixtureID Method PIT for ASR PIT Dev Test Dev Test

Clean 1 - - (cid:55) (cid:55) (cid:55) (cid:55)

Oracle binary Mask (IBM) 3 - - (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51)

Oracle

Angle Feature 7 - - (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:55) robust to noise. Row 5 shows the results of the baseline MIMO-Speech with the architecture originallyproposed in [23]. Row 6 shows the modiﬁed baseline by adding IPD features and a channel independentmasking network as mentioned in Section 3. This makes the baseline stronger and gives slightly betterresults. Oracle ASR-DOA integration experiments were performed by using the groundtruth DOA whileinference as a proof of concept and the results are shown in rows 7 & 8. We can see a very signiﬁcantimprovement in this case and for the noisy mixture the word error rates are reduced by a factor of two. Theresults are also slightly better than using the oracle binary masks. This proves the importance of feedinglocalization knowledge to a multi-speaker ASR system. We can also see that turning oﬀ PIT and ﬁxing thetarget sequence order based on the DOA order gives similar performance (row 8) with added beneﬁts. Thisnot only saves computation time while training but also makes the inference more informative by associatingthe DOA and its corresponding transcription.In rows 9-15 some of the DOA estimation methods are used to compute the angle features. All themethods here are high resolution and use γ = 1. We can see that using MUSIC and TOPS degradesthe ASR performance from the baseline. This shows that localization knowledge will be useful only if wecan estimate with good precision and reliability. The results with the proposed deep learning based DOAestimation and SEMD loss are shown in rows 11-15. With any of these methods we get results close to usingthe oracle DOA. We can see the CNN-BLSTM-SS model which gives the best localization performance inTable 2 also gives the best results here (row 12). This result almost matches the performance obtained withoracle masks in row 3. In this section the overlapped data from MC-WSJ-AV corpus [35] is used for evaluation. The positionsof the speakers are mentioned to be stationary throughout the utterance. There are six possible positionsin the room that the speakers were positioned. As there are two speakers in the mixture, there are ﬁfteenpossible position pairs. The real data has recordings from two arrays with the same conﬁguration placedin diﬀerent positions. From the schematic diagram given in [35], four of the six positions are very close toarray-1 and all positions are generally far from Array-2. The model trained with the simulated noisy mixtureis used here as they follow similar conﬁgurations. Although trained with simulated data, it is crucial to seehow our methods work on real data. As there are no groundtruth DOA labels, we cannot evaluate the angleprediction error.The DOA estimation here is performed with γ = 1 and the CNN-BLSTM-SS model here is without PITand with SEMD loss. As we know that there are ﬁfteen position pairs, we performed k-means clustering11 a) Proposed Method - Array-1 (b) TOPS - Array-1 (c) Proposed Method - Array-2 (d) TOPS - Array-2 Figure 3: Results of k-means clustering with the estimated azimuth angle pairs of the real 2-speaker overlapped data fromMC-WSJ-AV. The approximate groundtruth location are marked as black circles and the estimated cluster centers are givenby black cross marks.Table 4: ASR performance on the real 2-speaker overlapped data from MC-WSJ-AV. Word error rate (WER) is used as themetric for comparison. For WER, lower the better.

DOA Method Array-1 Array-2 Combination

MIMO-Speech N/A 34.3 47.6 N/AMIMO-Speech w/ IPD N/A 29.6 44.9 N/AMIMO-Speech + DOA Integration MUSIC 11.4 31.0 25.2MIMO-Speech + DOA Integration TOPS 10.9 32.9 27.1MIMO-Speech + DOA Integration CNN-BLSTM-SS 14.0 18.2 11.7with k = 15 on the estimated DOA pairs of all utterances. The pairs were ordered in an ascending ordersuch that the lower angle comes ﬁrst. The visualization of the clusters (cross marks are cluster centers)for both arrays obtained with the proposed method and TOPS are shown in Figure 3. From the schematicdiagram of the recording setup given in [35] we calculated approximate source positions and they are alsomarked as black circles. There seems to be a reasonable level of agreement between the estimated positionsfrom localization and the approximate groundtruths.The ASR performance is shown in Table 4. Using estimated DOAs from either the subspace method orthe CNN-BLSTM-SS model, ASR-DOA integration signiﬁcantly outperforms the MIMO-Speech baselinesfor this data. This further proves the importance of localization as a very pivotal frontend for ASR. Fromthe results we can see the positions of the sources are chosen to be favorable to array-1 so it gives betterresults. Using the subspace methods outperforms the CNN-BLSTM-SS model for array-1. Whereas forthe challenging array-2 data, the proposed CNN-BLSTM-SS model signiﬁcantly outperforms the subspacemethods.We perform a combination with an array selection scheme. The scheme is based on the intuition that it isnot good to use an array when the sources have a very small angle diﬀerence from its perspective. Array-1 isgiven preference during the selection but if the angle diﬀerence between the sources is less than 10 ◦ , array-2data is selected. We can see the selection mechanism to be eﬀective with CNN-BLSTM-SS model but stillit is slighlty worse than the Array-1 results of the subspace methods. The selection mechanism doesn’t helpwith subspace methods as its array-2 estimation is quite bad.

5. Conclusion & Future Work

We proposed a novel deep learning based model for multi-source localization that can classify DOAswith a high resolution. An extensive evaluation was performed with diﬀerent choices of architectures,loss functions, classiﬁcation resolution, and training schemes that handle permutation ambiguity. Oursource splitting model was shown to have a signiﬁcantly lower prediction error compared to both multi-label12lassiﬁcation model and subspace methods. We also proposed a soft earth mover distance (SEMD) lossfunction for the localization task that models inter-class relationship well for DOA estimation and hencepredicts near perfect DOA.We have also devised a method to use the proposed DOA estimation as a frontend for multi-talker ASR.This integration greatly helps speech recognition and shows the importance of using localization priors withfar-ﬁeld ASR models. Based on this ﬁnding, ASR performance was used as the metric of evaluation forDOA estimation in real data where DOA labels are not available to compute prediction errors.In future, we would like to extend our methods to also handle source counting within the model to makeit work for arbitrary number of sources. One possible approach would be to use a conditional chain modelthat is popular for source separation [44, 45]. The other important extension is adapting our method towork on more challenging and realistic data like CHiME-6 [46] which involves a complicated distributedarray setup with moving sources.

References [1] K. Nakadai, T. Takahashi, H. Okuno, H. Nakajima, Y. Hasegawa, H. Tsujino, Design and implementation of robot auditionsystem ‘HARK’ - open source software for listening to three simultaneous speakers, Advanced Robotics 24 (5-6) (2010)739–761.[2] K. Nakadai, K. Hidai, H. Mizoguchi, H. Okuno, H. Kitano, Real-time auditory and visual multiple-object tracking forhumanoids, in: International Joint Conferences on Artiﬁcial Intelligence (IJCAI), 2001, pp. 1425–1432.[3] T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang,A. Hurvitz, L. Jiang, S. Koubi, E. Krupka, I. Leichter, C. Liu, P. Parthasarathy, A. Vinnikov, L. Wu, X. Xiao, W. Xiong,H. Wang, Z. Wang, J. Zhang, Y. Zhao, T. Zhou, Advances in online audio-visual meeting transcription, in: IEEE AutomaticSpeech Recognition and Understanding Workshop (ASRU), 2019, pp. 276–283.[4] J. Barker, S. Watanabe, E. Vincent, J. Trmal, The ﬁfth CHiME speech separation and recognition challenge: Dataset,task and baselines, in: Interspeech, 2018, pp. 1561–1565.[5] R. Haeb-Umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoﬀmeister, M. L. Seltzer, H. Zen, M. Souden, Speechprocessing for digital home assistants: Combining signal processing with deep-learning techniques, IEEE Signal processingmagazine 36 (6) (2019) 111–124.[6] R. Schmidt, Multiple emitter location and signal parameter estimation, IEEE transactions on antennas and propagation34 (3) (1986) 276–280.[7] R. Roy, T. Kailath, ESPRIT-estimation of signal parameters via rotational invariance techniques, IEEE Transactions onAcoustics, Speech, and Signal Processing 37 (7) (1989) 984–995.[8] E. D. Di Claudio, R. Parisi, WAVES: weighted average of signal subspaces for robust wideband direction ﬁnding, IEEETransactions on Signal Processing 49 (10) (2001) 2179–2191.[9] Yeo-Sun Yoon, L. M. Kaplan, J. H. McClellan, TOPS: new doa estimator for wideband signals, IEEE Transactions onSignal Processing 54 (6) (2006) 1977–1989.[10] J. DiBiase, A high-accuracy, low-latency technique for talker localization in reverberant environments using microphonearrays, PhD Thesis, Brown University.[11] S. Chakrabarty, E. A. Habets, Multi-speaker DOA estimation using deep convolutional networks trained with noise signals,IEEE Journal of Selected Topics in Signal Processing 13 (1) (2019) 8–21.[12] R. Takeda, K. Komatani, Sound source localization based on deep neural networks with directional activate functionexploiting phase information, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),2016, pp. 405–409.[13] F. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, F. Piazza, A neural network based algorithm for speaker localizationin a multi-room environment, in: 26th IEEE International Workshop on Machine Learning for Signal Processing (MLSP),2016, pp. 1–6.[14] T. Hirvonen, Classiﬁcation of spatial audio location and content using convolutional neural networks, in: 138th AudioEngineering Society Convention, Vol. 2, 2015.[15] N. Yalta, K. Nakadai, T. Ogata, Sound source localization using deep learning models, Journal of Robotics and Mecha-tronics 29 (2017) 37–48.[16] S. Chakrabarty, E. A. Habets, Broadband doa estimation using convolutional neural networks trained with noise signals,in: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 136–140.[17] S. Adavanne, A. Politis, T. Virtanen, Direction of arrival estimation for multiple sound sources using convolutionalrecurrent neural network, in: European Signal Processing Conference (EUSIPCO), 2018, pp. 1462–1466.[18] W. He, P. Motlicek, J. Odobez, Deep neural networks for multiple speaker detection and localization, in: IEEE Interna-tional Conference on Robotics and Automation (ICRA), 2018, pp. 74–79.[19] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, Y. Gong, Multi-channel overlapped speech recognition with locationguided speech extraction network, in: IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 558–565.[20] A. S. Subramanian, C. Weng, M. Yu, S. Zhang, Y. Xu, S. Watanabe, D. Yu, Far-ﬁeld location guided target speechextraction using end-to-end speech recognition objectives, in: IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP), 2020, pp. 7299–7303.

21] A. S. Subramanian, C. Weng, S. Watanabe, M. Yu, Y. Xu, S.-X. Zhang, D. Yu, Directional ASR: A new paradigm for e2emulti-speaker speech recognition with source localization (2020). arXiv:2011.00091 .[22] X. Chang, W. Zhang, Y. Qian, J. Le Roux, S. Watanabe, MIMO-Speech: End-to-end multi-channel multi-speaker speechrecognition, in: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 237–244.[23] X. Chang, W. Zhang, Y. Qian, J. Le Roux, S. Watanabe, End-to-end multi-speaker speech recognition with transformer,in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020, pp. 6134–6138.[24] D. Yu, M. Kolbæk, Z. Tan, J. Jensen, Permutation invariant training of deep models for speaker-independent multi-talkerspeech separation, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017, pp.241–245.[25] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-Vectors: Robust dnn embeddings for speaker recogni-tion, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 5329–5333.[26] K. Vesel´y, S. Watanabe, K. ˇZmol´ıkov´a, M. Karaﬁ´at, L. Burget, J. H. ˇCernock´y, Sequence summarizing neural networkfor speaker adaptation, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2016,pp. 5315–5319.[27] E. Levina, P. Bickel, The earth mover’s distance is the mallows distance: some insights from statistics, in: IEEE Interna-tional Conference on Computer Vision (ICCV), Vol. 2, 2001, pp. 251–256.[28] L. Hou, C.-P. Yu, D. Samaras, Squared earth mover’s distance-based loss for training deep neural networks, NeurIPSWorkshop - Learning on Distributions, Functions, Graphs and Groups.[29] F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, A Comprehensive Study of Speech Separation:Spectrogram vs Waveform Separation, in: Interspeech, 2019, pp. 4574–4578.[30] M. Souden, J. Benesty, S. Aﬀes, On optimal frequency-domain multichannel linear ﬁltering for noise reduction, IEEETransactions on Audio, Speech, and Language Processing 18 (2) (2010) 260–276.[31] S. Kim, T. Hori, S. Watanabe, Joint CTC-attention based end-to-end speech recognition using multi-task learning, in:ICASSP, 2017, pp. 4835–4839.[32] D. B. Paul, J. M. Baker, The design for the wall street journal-based CSR corpus, in: Proceedings of the workshop onSpeech and Natural Language, Association for Computational Linguistics, 1992, pp. 357–362.[33] T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAMO: a British English speech corpus for large vocabularycontinuous speech recognition, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),Vol. 1, 1995, pp. 81–84.[34] K. Kinoshita, M. Delcroix, S. Gannot, E. A. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani,B. Raj, et al., A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speechprocessing research, EURASIP Journal on Advances in Signal Processing.[35] M. Lincoln, I. McCowan, J. Vepa, H. K. Maganti, The multi-channel wall street journal audio visual corpus (mc-wsj-av):speciﬁcation and initial experiments, in: IEEE Workshop on Automatic Speech Recognition and Understanding, 2005,pp. 357–362.[36] L. Drude, J. Heitkaemper, C. Boeddeker, R. Haeb-Umbach, SMS-WSJ: Database, performance measures, and baselinerecipe for multi-channel source separation and recognition, arXiv preprint arXiv:1910.13934.[37] J. B. Allen, D. A. Berkley, Image method for eﬃciently simulating small-room acoustics, The Journal of the AcousticalSociety of America 65 (4) (1979) 943–950.[38] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang,S. Watanabe, T. Yoshimura, W. Zhang, A comparative study on Transformer vs RNN in speech applications, in: IEEEAutomatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 449–456.[39] T. Hori, J. Cho, S. Watanabe, End-to-end speech recognition with word-based RNN language models, in: IEEE SpokenLanguage Technology Workshop (SLT), 2018, pp. 389–396.[40] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner,N. Chen, A. Renduchintala, T. Ochiai, ESPnet: End-to-end speech processing toolkit, in: Interspeech, 2018, pp. 2207–2211.[41] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B. Juang, Speech dereverberation based on variance-normalizeddelayed linear prediction, IEEE Transactions on Audio, Speech, and Language Processing 18 (7) (2010) 1717–1731.[42] L. Drude, J. Heymann, C. Boeddeker, R. Haeb-Umbach, NARA-WPE: A Python package for weighted prediction errordereverberation in Numpy and Tensorﬂow for online and oﬄine processing, in: ITG Fachtagung Sprachkommunikation,2018.[43] R. Scheibler, E. Bezzam, I. Dokmani´c, Pyroomacoustics: A python package for audio room simulation and array processingalgorithms, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 351–355.[44] K. Kinoshita, L. Drude, M. Delcroix, T. Nakatani, Listening to each speaker one by one with recurrent selective hearingnetworks, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 5064–5068.[45] J. Shi, X. Chang, P. Guo, S. Watanabe, Y. Fujita, J. Xu, B. Xu, L. Xie, Sequence to multi-sequence learning via conditionalchain mapping for mixture signals, Advances in Neural Information Processing Systems.[46] S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj,D. Snyder, A. S. Subramanian, J. Trmal, B. B. Yair, C. Boeddeker, Z. Ni, Y. Fujita, S. Horiguchi, N. Kanda, T. Yoshioka,N. Ryant, CHiME-6 challenge:tackling multispeaker speech recognition for unsegmented recordings (2020). arXiv:2004.09249 ..