[PDF] Nonnegative Tensor Factorization for Directional Blind Audio Source Separation

Abstract

We augment the nonnegative matrix factorization method for audio source separation with cues about directionality of sound propagation. This improves separation quality greatly and removes the need for training data, with only a twofold increase in run time. This is the first method which can exploit directional information from microphone arrays much smaller than the wavelength of sound, working both in simulation and in practice on millimeter-scale microphone arrays.

Full PDF

NNonnegative Tensor Factorization for DirectionalBlind Audio Source Separation

Noah D. Stein

Analog Devices | Lyric Labs: Cambridge, MA [email protected]

Abstract

We augment the nonnegative matrix factorization method for audio source separa-tion with cues about directionality of sound propagation. This improves separationquality greatly and removes the need for training data, with only a twofold increasein run time. This is the ﬁrst method which can exploit directional information frommicrophone arrays much smaller than the wavelength of sound, working both insimulation and in practice on millimeter-scale microphone arrays.

Nonnegative matrix factorization (NMF) has proven to be an effective method for audio sourceseparation [12]. We guide NMF to identify discrete sources by providing cues: direction of arrival(DOA) estimates computed from a microphone array for each time-frequency bin. We form a(potentially sparse) frequency × time × direction tensor X indicating the distribution of energy inthe soundscape and jointly solve for all the sources by ﬁnding tensors B (direction distribution persource), W (spectral dictionary per source), and H (time activations per dictionary element) to ﬁt X ( f, t, d ) ≈ (cid:88) s,z B ( d, s ) W ( f, z, s ) H ( t, z, s ) , where s and z index sources and dictionary subcomponents thereof. The crucial fact that B does notdepend on z forces the portions of the spectrogram explained by the dictionary for a given source tobe collocated in space: if B depended on z in addition to s like W and H do, then we could swapdictionary components between sources without affecting the sum and the model would have nopower to isolate and separate cohesive sources of sound. Advantages of our approach include: • Perceived separation quality – better than NMF; • No supervision – no clean audio examples needed; • Low overhead – computation on the same order as NMF; • Suitability for small arrays – usable DOA estimates can be obtained from arrays which aremuch smaller than the wavelengths of audible sounds and for which beamforming fails. Wehave tested this on millimeter-scale arrays; see Section 4.We focus on small arrays to enable applications where industrial design constraints make it expensiveto bring distant signals together for processing. Closer microphones make more integration possible,lowering cost and allowing for devices like smart watches whose sizes limit spacing. In real worlduse cases training data is often not available, especially for interferers, hence the desire for anunsupervised approach. In the end the perceived quality is better than the supervised version withlonger but reasonable running time.Previous work does not address the close microphone case; the conventional wisdom is that micro-phones separated by much less than the half wavelength required for beamforming provide no usable a r X i v : . [ s t a t . M L ] M a y irectional information. Linear unmixing methods like ICA [13] fail because close spacing makesthe mixing matrices ill-conditioned. DUET [17] does not apply because there are essentially noamplitude differences between the microphone signals at this scale; a phase-only version relies tooheavily on mixture-corrupted phase values and gives poor quality. Our method achieves good qualitywith necessarily poor direction information by enforcing the extra structure of an NMF decompositionfor each source. NMF and variants are common for audio source separation, but here we extend themto solve a problem previously assumed hopeless.Several approaches to extending NMF-like techniques to multichannel audio appear in the literature.FitzGerald, Cranitch, and Coyle apply a form of nonnegative tensor factorization (NTF) to thefrequency × times × number of channels tensor produced by stacking the spectrograms for allchannels [4]. This approach has two drawbacks. First, each dictionary element in the decompositionhas its own gain to each channel, so a post-NTF clustering phase is needed to group dictionaryelements into sources. Second, it relies on level differences between channels to distinguish sources,so it only applies to widely spaced microphones or artiﬁcial mixtures.Ozerov and Févotte address the ﬁrst drawback by constraining the factorization so the gains fromsource to channel are shared between the dictionary elements which comprise each source [10]. Lee,Park, and Sung use a similar factorization and confront the second drawback by stacking spectrogramsfrom beamformers in place of raw channels [9].While independent from our work, [9] can be viewed as an instance of the general method we presentin Section 3.1 below, but the former still requires microphone spacing wide enough for beamforming.Furthermore the computational resources required are proportional to the number of beamformersused, so for good spatial resolution the cost may be high. In [9] all experiments use beamformerspointed north, south, east, and west, which may provide poor separation when the sources of interestare not axis-aligned.Also related is the paper [15] on Directional NMF, which considers factoring the steered re-sponse power (SRP), a measure of the energy detected by a beamformer as a function of direction,time and frequency. This three-dimensional tensor is ﬂattened into a { discretized directions } × { spectrogram bins } matrix, to which NMF is applied with inner dimension equal to the number ofsources sought. Again this approach suffers in the face of closely-spaced microphones; furthermoreit does not model any of the structure expected within reasonable audio spectrograms. We compareour method experimentally to a variant of this in Section 4.This paper is organized as follows. Section 2 covers basic NMF and a simple DOA estimator. Ourmain contribution is introduced in Section 3.1 and efﬁcient implementation is explored in Sections 3.2and 3.3. We show experimental results in Section 4 and close with conclusions in Section 5. The material in this section is standard in audio source separation and could likely be skipped by anexpert. We present NMF here in equivalent probabilistic rather than matrix language [2] to make iteasy to extend to the new algorithm we introduce in Section 3 and ease comparison of algorithmicsteps, convergence proofs, and run time. This also gives an opportunity to make the seemingly newobservation that Gibbs’ inequality provides the optimizer in the M step of the EM algorithm for NMF.Doing so we avoid Lagrange multiplier computations, which in any case cannot tell a maximizerfrom a minimizer.

NMF is a technique to factor an elementwise nonnegative matrix X ∈ R F × T ≥ as X ≈ W H with W ∈ R F × Z ≥ and H ∈ R Z × T ≥ . The ﬁxed inner dimension Z (cid:28) F, T controls model complexity. Thistechnique is often applied to a time-frequency representation X (e.g. a magnitude spectrogram) of anaudio signal [12].We use probabilistic language, so we normalize and in place of X take a given probability distribution p obs ( f, t ) to decompose as p obs ( f, t ) ≈ (cid:80) z q ( f, t, z ) , where q is factored into either of the equivalentforms q ( f, t, z ) := q ( f, z ) q ( t | z ) = q ( f | z ) q ( t, z ) , (1)2 TF (a) Basic NMF (Section 2.1) S Z TF (b) NMF for source separation (Sec-tion 2.3)

SD Z TF (c) NTF with directionality (Sec-tion 3.1)

Figure 1: Graphical models for the factorizations in this paper. The input data is a joint distributionover the shaded variables.as in Figure 1(a). The values of z index a dictionary of prototype spectra q ( f | z ) which combineaccording to the time activations q ( t, z ) . We seek to maximize the cross entropy α ( q ) := (cid:88) f,t p obs ( f, t ) log q ( f, t ) = (cid:88) f,t p obs ( f, t ) log (cid:88) z q ( f, t, z ) for simplicity of exposition, though other objectives may provide better performance [7].We use Minorization-Maximization, a generalization of EM [6], to iteratively optimize over q . Fixa factored distribution q . The essential step is to ﬁnd a factored distribution q with higher crossentropy. Jensen’s inequality on the logarithm gives: (cid:88) z q ( z | f, t ) log q ( f, t, z ) q ( z | f, t ) ≤ log (cid:88) z q ( z | f, t ) q ( f, t, z ) q ( z | f, t ) = log (cid:88) z q ( f, t, z ) for all f, t, and q . When q = q we have equality, since all the terms being averaged are equal.Substituting into α ( q ) gives β ( q ) := (cid:88) f,t,z p obs ( f, t ) q ( z | f, t ) log q ( f, t, z ) q ( z | f, t ) ≤ α ( q ) , again with equality at q = q ( β is said to minorize α at q ). If q maximizes β then α ( q ) ≥ β ( q ) ≥ β ( q ) = α ( q ) .The denominator in β only contributes an additive constant, so we can equivalently maximize thecross entropy γ between r ( f, t, z ) := p obs ( f, t ) q ( z | f, t ) and q ( f, t, z ) : γ ( q ) := (cid:88) f,t,z r ( f, t, z ) log q ( f, t, z ) = (cid:88) z r ( z ) (cid:88) f r ( f | z ) log q ( f | z ) + (cid:88) t,z r ( t, z ) log q ( t, z ) . Though our original goal was to maximize a cross entropy, we have made progress in the sensesthat (a) there is no longer a sum inside the logarithm and (b) we have decoupled the terms involving q ( f | z ) and q ( t, z ) . These can be chosen independently and arbitrarily while maintaining the factoredform q ( f, t, z ) = q ( f | z ) q ( t, z ) .To ﬁnd the optimal q we apply Gibbs’ inequality: for a ﬁxed probability distribution σ , the probabilitydistribution τ which maximizes the cross entropy (cid:80) u σ ( u ) log τ ( u ) is τ = σ . Therefore wemaximize γ by choosing q ( f | z ) := r ( f | z ) and q ( t, z ) := r ( t, z ) . Typically q ( f, t, z ) (cid:54) = r ( f, t, z ) ; rather q ( f, t, z ) is a product of a marginal and a conditional of r , which itself might notfactor.These updates can be viewed as alternating projections. We seek a distribution q ( f, t, z ) which (a)factors and (b) has marginal q ( f, t ) close to p obs ( f, t ) . We begin with q ( f, t, z ) which satisﬁes (a)but not (b) and modify it to get r ( f, t, z ) = p obs ( f, t ) q ( z | f, t ) , which gives (b) exactly but destroys(a). Then we multiply a marginal and conditional of r ( f, t, z ) to get q ( f, t, z ) , sacriﬁcing (b) tosatisfy (a), and repeat. 3 .2 Multiplicative updates for NMF These iterations can be computed efﬁciently via the celebrated multiplicative updates [8]. To compute q ( t, z ) = (cid:88) f r ( f, t, z ) = (cid:88) f p obs ( f, t ) q ( z | f, t ) = (cid:88) f p obs ( f, t ) q ( f | z ) q ( t, z ) q ( f, t )= q ( t, z ) (cid:88) f p obs ( f, t ) q ( f, t ) (cid:124) (cid:123)(cid:122) (cid:125) call this ρ ( f,t ) q ( f | z ) = q ( t, z ) (cid:88) f ρ ( f, t ) q ( f | z ) , matrix multiply to ﬁnd q ( f, t ) = (cid:80) z (cid:48) q ( f | z (cid:48) ) q ( t, z (cid:48) ) , elementwise divide to get ρ ( f, t ) := p obs ( f, t ) /q ( f, t ) , matrix multiply for (cid:80) f ρ ( f, t ) q ( f | z ) , and elementwise multiply by q ( t, z ) .Reuse ρ ( f, t ) to compute q ( f, z ) analogously and condition to get q ( f | z ) .This method avoids storing any F × T × Z arrays, such as r ( f, t, z ) or q ( z | f, t ) . Indeed, thisimplementation uses Θ( F T + F Z + T Z ) memory (proportional to output plus input) total, but Θ( F T Z ) arithmetic operations per iteration. Though our focus is on unsupervised methods, we recall for comparison how to use NMF to separate S audio sources. We decompose the mixture as a weighted sum over sources s per Figure 1(b): p obs ( f, t ) ≈ (cid:88) s q ( s ) (cid:88) z q ( f | z, s ) q ( t, z | s ) . (2)Mathematically, this is NMF with inner dimension ZS . Fitting such a model as in Section 2.1 givesno separation: the cross entropy objective is invariant to swapping dictionary elements betweensources, so it does not encourage all dictionary elements of a modeled source to correspond to asingle physical source.A typical workaround uses training data ˆ p obs s ( f, t ) corresponding to recordings of sounds typical ofeach of the sources alone. We apply NMF to these for each s separately, learning a representativedictionary ˆ q s ( f | z ) . The factor ˆ q s ( t, z ) represents when and how active these dictionary elementsare in the training data, so it is discarded as irrelevant for separation (this does lose information abouttime evolution of activations).We apply NMF again to learn (2) with the twist that we ﬁx q ( f | z, s ) := ˆ q s ( f | z ) for all iterations,so the model cannot freely swap dictionary elements between sources. Instead of improving bothterms in γ ( q ) we improve one and leave the other ﬁxed; the cross entropy still increases. After NMF, q ( s | f, t ) gives a measure of the contribution from source s in each time-frequency bin.A common use case is when p obs ( f, t ) is a normalized spectrogram, computed as the magnitude of aShort-Time Fourier Transform (STFT). It is typical to approximately reconstruct the time-domainaudio of a separated source s by multiplying the magnitude component p obs ( f, t ) q ( s | f, t ) with thephase of the mixture STFT, then taking the inverse STFT. Considering the outputs to be the mask q ( s | f, t ) or reconstructed time-domain audio, separation takes Θ( F T S + F ZS + T ZS ) memorytotal and Θ( F T ZS ) arithmetic operations per iteration. Estimating DOA at an array is a well-studied problem [1] and various methods can be used in theframework of Section 3. We use the least squares method (perhaps the simplest) for estimating aDOA at each time-frequency bin [14]. We take as given the STFTs of audio signals recorded at eachof M microphones. The same procedure is applied to all bins, so we focus on a single bin and itsSTFT values Y , . . . , Y M .Assume this bin is dominated by a point source far enough away to appear as a plane wave and thearray is small enough to avoid wrapping the phases ∠ Y i . Letting x i denote the position of microphone i and k the wave vector, we have ∠ Y i − ∠ Y = ( x i − x ) · k . We solve these linear equations for k in a least squares sense. The direction of k serves as a DOA estimate for the chosen bin. The4oefﬁcients of k are ﬁxed by the geometry, so the least squares problems for all time-frequency binscan be solved with a single pseudoinverse at design time and a small matrix multiplication for eachbin thereafter. This section is parallel to Section 2.1, except instead of a matrix p obs ( f, t ) we take as given an arrayor tensor p obs ( f, t, d ) , interpreted as a distribution over time, frequency, and DOA quantized to aﬁnite domain of size D . In practice we take p obs ( f, t, d ) = p obs ( f, t ) p obs ( d | f, t ) , where p obs ( f, t ) is again a normalized spectrogram and p obs ( d | f, t ) is an estimate of direction per time-frequencybin. There is little amplitude variation between the closely-spaced microphones, so we can derive p obs ( f, t ) from any of them – spatial diversity is captured by p obs ( d | f, t ) . For efﬁciency we choose p obs ( d | f, t ) to place all weight on the DOA estimated as in Section 2.4, but we could also use e.g.the normalized output of a family of D beamformers as in [15].We ﬁt p obs ( f, t, d ) ≈ (cid:80) s,z q ( f, t, d, z, s ) for the factorization q ( f, t, d, z, s ) := q ( s ) q ( f | s, z ) q ( t, z | s ) q ( d | s ) = q ( d, s ) q ( f | s, z ) q ( t, z | s ) , (3)represented in Figure 1(c). A distribution q ( d | s ) rather than a ﬁxed DOA per source allows fornoise, slight movements of sources, and modeling error.Crucially, (3) forces q ( d | s, z ) = q ( d | s ) not to depend on z : dictionary elements correspondingto the same source explain energy coming from the same direction. In particular, cross entropy onthis model is not invariant to swapping dictionary elements between sources. If the model tried toaccount for multiple physical sources within a single source s by choosing a multimodal q ( d | s ) , thecross entropy would be low because some dictionary elements for s would not have energy at somemodes. We can thus hope to learn the model (3) from p obs ( f, t, d ) alone, without training data.We use Minorization-Maximization to ﬁt (3) to p obs ( f, t, d ) , just as we did to ﬁt (1) to p obs ( f, t ) inSection 2.1. The same argument leads us to begin with a factored model q ( f, t, d, z, s ) := q ( d, s ) q ( f | s, z ) q ( t, z | s ) , force the desired marginal to obtain r ( f, t, d, z, s ) := p obs ( f, t, d ) q ( z, s | f, t, d ) , and return to factored form by computing conditionals of r : q ( d, s ) := r ( d, s ) , q ( f | s, z ) := r ( f | s, z ) , q ( t, z | s ) = r ( t, z | s ) . We iterate, then compute the soft mask q ( s | f, t ) as in Section 2.3. Without training data thecorrespondence between learned sources and sources in the environment is unknown a priori, but thelearned factors q ( d | s ) can help disambiguate. As in Section 2.2 we can turn these equations into multiplicative updates and in the process reduce theresource requirements. The savings come from ordering and factoring the operations appropriately aswell as expressing tensor operations in terms of matrix multiplications (of course to multiply matrices A and B we would not waste memory computing the full three-dimensional tensor A ij B jk beforesumming; when possible we avoid this for tensors as well). For example, we can calculate q ( d, s ) = (cid:88) f,t,z r ( f, t, d, z, s ) = q ( d, s ) (cid:88) f,t p obs ( f, t, d ) (cid:80) s (cid:48) q ( d, s (cid:48) ) q ( f, t | s (cid:48) ) q ( f, t | s )= q ( d, s ) (cid:88) f,t p obs ( f, t, d ) q ( f, t, d ) (cid:124) (cid:123)(cid:122) (cid:125) call this ρ ( f,t,d ) q ( f, t | s ) = q ( d, s ) (cid:88) f,t ρ ( f, t, d ) q ( f, t | s ) , (4)5s follows. Compute q ( f, t | s ) ; for each s this is an F × Z times Z × T matrix multiplication.Then compute the denominator q ( f, t, d ) as a D × S times S × F T matrix multiplication. Divide p obs ( f, t, d ) elementwise by the result and call this ρ ( f, t, d ) . Compute the remaining sum as a D × F T times

F T × S matrix multiplication. Multiply by q ( d, s ) elementwise to get q ( d, s ) .Reusing ρ , similar computations yield q ( f | z, s ) and q ( t, z | s ) . The total memory required is Θ( F T S + F T D + F ZS + T ZS ) , again proportional to the memory required to store the input andoutput (both the factorization and the mask q ( s | f, t ) are here considered part of the output). Thenumber of arithmetic operations used at each iteration is Θ( F T S ( D + Z )) . So in addition to neverhaving to allocate memory for any of the size F T DZS arrays referred to in Section 3.1, we do noteven have to explicitly compute all their elements individually.

Suppose all the mass in each time-frequency bin is assigned to a single direction d ( f, t ) (e.g. theoutput of Section 2.4, discretized), so p obs ( d | f, t ) = δ ( d = d ( f, t )) in terms of the Kronecker δ .The input ( p obs ( f, t ) and d ( f, t ) ) then has size Θ( F T ) and the implementation simpliﬁes further.Since r ( f, t, d, z, s ) is only nonzero when d = d ( f, t ) , we only need to compute the denominator q ( f, t, d ) of (4) for d = d ( f, t ) . To do this, we compute q ( f, t | s ) as before and then evaluate q ( d, s ) at d = d ( f, t ) , yielding another F × T × S tensor q ( d ( f, t ) , s ) . Summing the elementwiseproduct q ( f, t | s (cid:48) ) q ( d ( f, t ) , s (cid:48) ) over s (cid:48) yields the F × T array q ( f, t, d ( f, t )) .Instead of deﬁning ρ ( f, t, d ) as in Section 3.2 we deﬁne ρ ( f, t ) := p obs ( f,t ) q ( f,t,d ( f,t )) , so r ( f, t, d, z, s ) = ρ ( f, t ) δ ( d = d ( f, t )) q ( d, s ) q ( f | z, s ) q ( t, z | s ) . Marginalizing, we get: q ( d, s ) = (cid:88) f,t,z r ( f, t, d, z, s ) = q ( d, s ) (cid:88) f,t : d ( f,t )= d ρ ( f, t ) q ( f, t | s ) , which can now be computed naively in Θ( F T S ) memory and arithmetic operations. For q ( f, z, s ) = (cid:88) t,d r ( f, t, d, z, s ) = q ( f | z, s ) (cid:88) t ρ ( f, t ) q ( d ( f, t ) , s ) q ( t, z | s ) we multiply ρ ( f, t ) by q ( d ( f, t ) , s ) , which takes Θ( F T S ) memory and operations, then computethe sum over t as S matrix multiplications of size F × T times T × Z . This takes Θ( F ZS ) memoryand Θ( F T ZS ) operations. Then we multiply elementwise by q ( f | z, s ) and condition to get q ( f | z, s ) . The computation of q ( t, z | s ) is similar.The resource requirements are Θ( F T S + F ZS + T ZS ) memory total and Θ( F T ZS ) arithmeticoperations per iteration . This is the same order as supervised NMF (Section 2.3), though these areapples and oranges: one uses direction information and the other uses clean audio training data. q ( d | s ) = q θ s ( d | s ) So far the algorithm is invariant to permuting the direction labels d : we have not told it if d = 1 isclose to d = 2 , etc. In favorable circumstances, when there is low noise and EM does not get stuck inbad local optima, we experimentally observe that the algorithm infers the geometry from the data inthe sense that the learned q ( d | s ) varies smoothly with d for each s . In less favorable circumstancesan underlying source of sound can be split between multiple “separated” sources reported by thealgorithm, resulting in audible artifacts and qualitatively poor separation.We improve output quality by enforcing structure on q ( d | s ) . For concreteness we consider the casewhen the directions d = 1 , . . . D indicate azimuthally equally spaced angles all the way around themicrophone array. We constrain q ( d | s ) to the two-dimensional exponential family with sufﬁcientstatistics φ ( d ) = (sin(2 πd/D ) , cos(2 πd/D )) , a discretized version of the von Mises family. Thedistributions q θ s ( d | s ) ∝ exp( φ ( d ) θ s ) look like unimodal bumps on a (discretized) circle, with thepolar coordinates of the natural parameters θ s controlling the position and height of the bump. Strictly speaking D must enter somewhere. We assume the expected use case in which D < F T . SS_EVAL in dB run time asAlgorithm SDR SIR SAR % real timeIdeal Binary Mask 13.9 22.7 14.7 1.4 %Ideal Ratio Mask 13.1 18.6 15.0 1.5 %

Directional NTF, constrained q θ s ( d | s ) q ( d | s ) random instances. Ideal masksuse ground truth and should be viewed as bounds only. A range of ± . is an (at least) conﬁdence interval for all true average BSS_EVAL metrics. Runtimes are from python code ( http://arxiv.org/src/1411.5010/anc ) on a

Macbook Pro.At a particular iteration call the exact maximizer of the minorizer (with no exponential familyconstraint) given by Gibbs’ inequality ˆ q ( d | s ) ; we saw how to compute this in Sections 3.2 and 3.3.Replacing q ( d | s ) in the cross entropy objective with q θ s ( d | s ) , the same argument as beforegoes through and all distributions except q θ s ( d | s ) are computed in the same way. The terminvolving q θ s ( d | s ) works out to be (cid:80) d ˆ q ( d | s ) log q θ s ( d | s ) and its gradient with respect to θ s is (cid:80) d [ q θ s ( d | s ) − ˆ q ( d | s )] φ ( d ) , the difference in the moments of φ ( d ) with respect to q θ s and thetarget ˆ q . Since the updates are being iterated anyway, we observe empirically that it sufﬁces to ﬁx astep size λ and take a single gradient step in θ s at each iteration: θ new s = θ old s + λ (cid:88) d (cid:2) q θ old s ( d | s ) − ˆ q ( d | s ) (cid:3) φ ( d ) . As shown in Table 1 this change increases output quality signiﬁcantly with only a marginal effect onrun time.

Qualitatively, Directional NTF separates well on a mm × mm rectangular MEMS microphone arraywe built. Audio of three people talking simultaneously recorded with that array and the correspondingseparated output is available in the supplemental materials. A portion of this recording and itsseparation into sources is shown in Figure 2.For a quantitative experiment, we simulated this array in a m × m room. For each of instances of the experiment, we randomly selected two TIMIT sentences [5] from different speakersin the TIMIT test data set and used the code from [3] to simulate these sentences being spokensimultaneously in the room. We simulated data due to the lack of publicly available data with suchclosely-spaced microphones and to enable us to use the ground truth to quantify performance. Sourceand array locations were uniformly random, conditioned to be at least . m from the walls.We compare four factorization-based source-separation algorithms: Directional NTF as in Section 3.3with the constrained q θ s ( d | s ) from Section 3.4; Directional NTF with unconstrained q ( d | s ) ;a less structured version called Directional NMF [15], which consists of factoring p obs ( f, t, d ) ≈ (cid:80) s q ( f, t | s ) q ( d, s ) , to highlight the importance of imposing structure on the separated sources;and Supervised NMF [12] as reviewed in Section 2.3. The ﬁrst three methods receive only thefour-channel mixed audio, while the fourth receives one channel of mixed audio and a different cleanTIMIT sentence of training data for each speaker. For an upper bound we compare two oracularmasks. See Table 1 for results using the mir_eval implementation [11] of the BSS_EVAL metrics[16].All algorithms are set to extract S = 2 sources. Directional NTF and Supervised NMF each modelsources with Z = 20 dictionary elements; Directional NMF has no such parameter. Directional meth-ods receive a least-squares estimated azimuthal DOA angle for each time-frequency bin (Section 2.4)quantized to D = 24 levels and the constrained q θ s ( d | s ) method uses a learning rate λ = 2 for thenatural parameters. Note that in [15] the Directional NMF model was used with a dense estimate of7igure 2: The top shows a portion of a spectrogram from a real recording of three people talkingsimultaneously, each about meter from a mm × mm rectangular microphone array, with time-frequency bins colored according to azimuth angle estimated as in Section 2.4. Interpreting coloras height out of the page, this is the sparse time × frequency × direction tensor used as the inputto (Sparse) Directional NTF. The bottom shows the computed masks for separating this signal intothree sources with q ( s | f, t ) for s = 1 , , being interpreted as the red, blue, and green channels,respectively. Raw input and separated output audio are available in the supplemental materials.energy as a function of direction, rather than a sparse estimate of a single dominant direction for eachtime-frequency bin, so these results are not directly comparable to that paper. Directional NTF is better than the other (non-oracular) algorithms according to all

BSS_EVAL metrics(Table 1). Real mixtures admit qualitative perceptual improvements in line with simulation. We closewith directions for future work.First, this method ﬁts naturally into the basic NMF / NTF framework. As such it should be extensibleusing the many improvements to these methods available in the literature.Second, when blindly separating speech from background noise, it is an open question how toautomatically determine which source is speech. In some applications one can infer this from thecenters of the learned distributions q ( d | s ) and prior information about the location of the speaker. Inothers one may expect diffuse noise and call the source with q ( d | s ) more tightly-peaked the speaker.A more broadly-applicable source selection method would be desirable.Third, we leave analysis of performance as a function of geometry and level of reverberation forfuture work. Acknowledgements

The author thanks Paris Smaragdis, Johannes Traa, Théophane Weber, and David Wingate for theirhelpful suggestions regarding this work.

References [1] J. Benesty, J. Chen, and Y. Huang.

Microphone Array Signal Processing , volume 1 of

Topics in SignalProcessing . Springer, 2008.

2] Chris Ding, Tao Li, and Wei Peng. On the equivalence between non-negative matrix factorization andprobabilistic latent semantic indexing.

Computational Statistics and Data Analysis , 52:3913 – 3927, 2008.[3] Ivan Dokmanic, Robin Scheibler, and Martin Vetterli. Raking the cocktail party.

IEEE Journal of SelectedTopics in Signal Processing , 2014.[4] Derry FitzGerald, Matt Cranitch, and Eugene Coyle. Non-negative tensor factorisation for sound sourceseparation. In

Proceedings of the Irish Signals and Systems Conference , September 2005.[5] John Garofolo et al. TIMIT acoustic-phonetic continuous speech corpus LDC93S1. Web Download.Philadelphia: Linguistic Data Consortium, 1993.[6] David R. Hunter and Kenneth Lange. Quantile regression via an MM algorithm.

Journal of Computationaland Graphical Statistics , 9(1):60–77, March 2000.[7] Brian King, Cédric Févotte, and Paris Smaragdis. Optimal cost function and magnitude power for NMF-based speech separation and music interpolation. In

Proceedings of the IEEE Workshop on MachineLearning for Signal Processing , September 2012.[8] Daniel D. Lee and H. Sebastian Seung. Algorithms for Non-negative Matrix Factorization. In

Advances inNeural Information Processing Systems , volume 13, pages 556–562. MIT Press, 2000.[9] Seokjin Lee, Sang Ha Park, and Koeng-Mo Sung. Beamspace-domain multichannel nonnegative matrixfactorization for audio source separation.

IEEE Signal Processing Letters , 19(1):43–46, January 2012.[10] Alexey Ozerov and Cédric Févotte. Multichannel nonnegative matrix factorization in convolutive mixturesfor audio source separation.

IEEE Transactions on Audio, Speech, and Language Processing , 18(3):550–563, March 2010.[11] Colin Raffel, Brian McFee, Eric J. Humphrey, Justin Salamon, Oriol Nieto, Dawan Liang, and DanielP. W. Ellis. mir_eval: A transparent implementation of common MIR metrics. In

Proceedings of the 15thInternational Conference on Music Information Retrieval , 2014.[12] Bhiksha Raj and Paris Smaragdis. Latent variable decomposition of spectrograms for single channelspeaker separation. In

Proceedings of WASPAA , 2005.[13] Paris Smaragdis. Blind separation of convolved mixtures in the frequency domain.

Neurocomputing ,22:21–34, 1998.[14] Johannes Traa. Multichannel source separation and tracking with phase differences by random sampleconsensus. Master’s thesis, University of Illinois, 2013.[15] Johannes Traa, Paris Smaragdis, Noah D. Stein, and David Wingate. Directional NMF for joint sourcelocalization and separation. In

Proceedings of WASPAA , 2015.[16] Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. Performance measurement in blind audio sourceseparation.

IEEE Transactions on Audio, Speech, and Language Processing , 14(4):1462–1469, 2006.[17] Özgür Yılmaz and Scott Rickard. Blind separation of speech mixtures via time-frequency masking.

IEEETransactions on Signal Processing , 52(7):1830 – 1847, July 2004., 52(7):1830 – 1847, July 2004.