[PDF] Generalized Minimal Distortion Principle for Blind Source Separation

Abstract

We revisit the source image estimation problem from blind source separation (BSS). We generalize the traditional minimum distortion principle to maximum likelihood estimation with a model for the residual spectrograms. Because residual spectrograms typically contain other sources, we propose to use a mixed-norm model that lets us finely tune sparsity in time and frequency. We propose to carry out the minimization of the mixed-norm via majorization-minimization optimization, leading to an iteratively reweighted least-squares algorithm. The algorithm balances well efficiency and ease of implementation. We assess the performance of the proposed method as applied to two well-known determined BSS and one joint BSS-dereverberation algorithms. We find out that it is possible to tune the parameters to improve separation by up to 2 dB, with no increase in distortion, and at little computational cost. The method thus provides a cheap and easy way to boost the performance of blind source separation.

Full PDF

GGeneralized Minimal Distortion Principle for Blind Source Separation

Robin Scheibler

LINE Corporation, Japan [email protected]

Abstract

We revisit the source image estimation problem from blindsource separation (BSS). We generalize the traditional min-imum distortion principle to maximum likelihood estimationwith a model for the residual spectrograms. Because residualspectrograms typically contain other sources, we propose to usea mixed-norm model that lets us ﬁnely tune sparsity in timeand frequency. We propose to carry out the minimization of themixed-norm via majorization-minimization optimization, lead-ing to an iteratively reweighted least-squares algorithm. Thealgorithm balances well efﬁciency and ease of implementa-tion. We assess the performance of the proposed method asapplied to two well-known determined BSS and one joint BSS-dereverberation algorithms. We ﬁnd out that it is possible totune the parameters to improve separation by up to , withno increase in distortion, and at little computational cost. Themethod thus provides a cheap and easy way to boost the perfor-mance of blind source separation.

Index Terms : blind source separation, scale ambiguity, mini-mal distortion, sparsity inducing norms, MM algorithm

1. Introduction

Blind source separation (BSS) conveniently allows to separatea mixture signal into its constitutional components without theneed for prior information [1]. A pioneering method of BSS isindependent component analysis (ICA) that only requires thatthe signal components be statistically independent and non-Gaussian [2]. In its canonical form, ICA tackles determinedlinear mixtures where the number of components is the sameas that of sensors. In this case, the separation task boils downto ﬁnding a square demixing matrix making the output compo-nents independent. The determined BSS problem suffers fromtwo inherent ambiguities. First, any permutation of the sourcesin the output is equally acceptable. Second, the sources maybe scaled arbitrarily. The ﬁrst problem is particularly problem-atic in frequency-domain BSS (FD-BSS) [3], where sources ex-tracted at each frequency must be aligned. This can be donevia clustering [4], or by considering the joint distribution overfrequencies as in independent vector analysis (IVA) [5, 6].Relatively less attention has been given to the scale ambigu-ity problem. For FD-BSS on acoustic mixtures, the ambiguity isequivalent to an arbitrary ﬁltering of sources. Without a correc-tion step, separated sources typically do not sound natural at all.This can be addressed by estimating source images , that is thesource signal as perceived at the microphone locations. Therehas been traditionally two ways of doing it. First, the so-called projection back (PB) method makes use of the linearity of de-termined BSS [7]. It relies on the observation that the columnsof the inverse of the demixing matrix are steering vectors for thesources. This method may be unstable if the demixing matrixis poorly conditioned, which is not frequent, but may happenfor some algorithms. The second method applies the minimal distortion principle (MDP) to adjust the scale compared to theinput microphone signal. Exploiting the independence of theother signals, it ﬁnds the ﬁlter minimizing the squared distancebetween the separated source and the input signal. From a max-imum likelihood point of view, this method assumes a Gaus-sian distribution of the residual when computing the distortion.However, the residual is in fact the sum of the other sourcesand background noise. Due to the non-Gaussianity of sources,it is unlikely to be Gaussian, leading to a sub-optimal choicefor the scaling ﬁlter. For example, residual sources may havesome very large components. A squared norm will try to reducethem, possibly at the cost of the target source. For a detailedcomparison and analysis of both methods, see [8].In this paper, we propose the generalized minimal distortionprinciple (GMDP) that uses the maximum likelihood estimator(MLE) for the image sources. We futher propose to instantiateGMDP based on sparsity promoting mixed norms. The intu-ition for using such a measure of distortion is that we want toallow the residual from minimal distortion to have some largeentries. The use of (cid:96) p -norms, and the (cid:96) -norm in particular, forthis purpose has been popularized by the LASSO algorithm [9]and the compressed sensing literature [10]. Another way to un-derstand the use of such norms is via a generative model ofthe residual and maximum likelihood estimation. For example,the (cid:96) norm corresponds to a Laplace model. What we pro-pose is to penalize the residual between the separated sourceand the reference input signal using a mixed norm (cid:96) p,q , for < p ≤ q ≤ . The mixed norm allows to promote sparsity atdifferent rates across time and mixtures (i.e. time and frequencyfor audio signals). Unlike the (cid:96) -norm, there is no closed-formsolution for the mixed norm minimization. Instead we relyon majorization-minimization (MM) whereas a surrogate func-tion dominating the objective is repeatedly minimized [11]. Weconstruct the surrogate function from an inequality previouslyused in the context of sound ﬁeld decomposition [12]. The ﬁ-nal algorithm falls in the family of iteratively reweighted least-squares (IRLS), that has been heavily investigated in the contextof sparse regression [13, 14]. We validate the proposed GMDPvia large numerical simulations of determined speech separa-tion. We investigate the performance for several BSS algo-rithms: AuxIVA [15], ILRMA [16], and joint BSS and derever-beration ILRMA-T [17]. We sweep values of < p ≤ q ≤ for different number of sources and ﬁnd that our approach out-performs both MDP and PB in terms of standard BSS metrics.The code for the experiments is shared at https://github.com/fakufaku/2020_interspeech_gdmp .The rest of this paper is as follows. Section 2 covers theconventional BSS scaling strategies. Section 3 describes theproposed method. The result of numerical experiments is shownin Section 5. a r X i v : . [ ee ss . A S ] S e p . Background The notation in this paper is as follows. We use lower and uppercase bold letters for vectors and matrices, respectively. Further-more, A (cid:62) and A H denote the transpose and conjugate trans-pose of matrix A, respectively. The Euclidean norm of vector v is (cid:107) v (cid:107) = ( v H v ) / . The diagonal matrix with v on its di-agonal is denoted diag( v ) . Unless speciﬁed otherwise, indicesk, m, f, and n are for source, sensor, frequency, and time, re-spectively. They always take the ranges from the correspondingcapital letter, i.e., K , M , F , and N , respectively.We consider FD-BSS with K sources and M sensors in theshort time Fourier transform (STFT) domain [18]. The sensorinputs are described by the linear mixing model x mfn = K (cid:88) k =1 h mkf s kfn + b mfn , (1)where x mfn , s kfn ∈ C are the m th sensor and k th source sig-nals, respectively at frequency f and frame n , and h mkf ∈ C is the transfer function between the two. The term b mfn op-tionally encompasses extra background noise and model mis-match. We can conveniently group the sensor signals in thevector x fn = [ x fn , . . . , x Mfn ] (cid:62) , and the sources similarlyin s fn and b fn , respectively. Deﬁning the channel matrix as H f ∈ C M × K such that ( H f ) mk = h mkf , (1) can be writtenin the compact form x fn = H f s fn + b fn , ∀ f, n. (2)Under this model, BSS algorithms may at best attempt to re-cover a source vector estimate y fn such that there is a mixingmatrix A f ∈ C M × K and x fn ≈ A f y fn , ∀ f, n. (3)The scale ambiguity is clear since for any non-singular diagonalmatrix D , A f D − and Dy fn form an equally valid solution.To avoid this scaling ambiguity, the source images are soughtinstead. These are the sources as measured at a sensor location,e.g. the k th source at the m th sensor is ˆ s mkfn = h mkf s kfn .When A f is available, then from its k th column a kf , the sourceimages of the k th source can be obtained ˆ y kfn = a fn y kfn , (4)so that x fn ≈ (cid:80) k ˆ y fn . Unfortunately, A f is typically notknown and must also be estimated. The so-called projection back technique is applicable when thenumber of sources is the same as that of sensors, i.e. M = K , [7], and b mfn = 0 in (1). In this case, the demixing istypically done by estimating a square demixing matrix W f ∈ C M × M so that y fn = W f x fn , ∀ f, n. (5)In that case, it is clear that (3) holds with equality with A f = W − f . Thus, the estimated source image is ˆ y fn = A f e k y kfn = W − f e k y kfn . (6)where e k is the i th column of the identity matrix. This methodis widely used in practice and works reasonably well, as longas W f is well-conditioned. This is usually the case as mostBSS algorithms either impose its orthogonality, e.g., [19], orpenalize it with a log-determinant term, e.g., [15]. In contrast, MDP ﬁnds the mixing weights that minimize thesum of squared differences between the separated source andthe microphone inputs [20, 21], i.e., ˆ y fn = a kf y kfn , with a kf = arg min a ∈ C M E (cid:107) x fn − a y kfn (cid:107) . (7)While derived originally from a different perspective, thissource image estimator is optimal in the maximum likelihoodsense when sources are uncorrelated and the background noiseis Gaussian. Under the uncorrelation assumption, one can show, E (cid:107) x fn − a y kfn (cid:107) = E (cid:107) h kf s kfn − a y kfn (cid:107) + const. , (8)where h kf is the k th column of H f . Thus, (7) is indeed theMLE of a kf if b mfn is Gaussian. The MDP can be shown to beequivalent to PB under the uncorrelation assumption [8]. How-ever, it is more stable and can deal with K (cid:54) = M . In practice,however, both assumptions for its optimality are routinely vio-lated. In multivariate source models, such as IVA, uncorrelationis not required at all frequencies. In addition, the background istypically not Gaussian. We address these limitations in the nextsection.

3. Generalized Minimal DistortionPrinciple

Consider the residual signal, where the scaling factor a mkf isassumed known such that h mkf s kfn = a mkf y kfn , then e mfn = x mfn − a mkf y kfn = (cid:88) (cid:96) (cid:54) = k h mkf s (cid:96)fn + b mfn . (9)In other words, for the correct scaling factor, the residual isidentical to the mixing model (1) with the k th source removed.Note that sources are typically assumed to have non-Gaussiandistributions. If their number is large, the central limit theoremmay be invoked to justify Gaussianity of e mfn . However, thetypical number of sources in practice is small, e.g. two to four.In this case, the residual is also expected to be non-Gaussian.We propose a generalized minimal distortion principle(GMDP) with a tunable error model. To allow modelling ofinter-frequency and inter-frame dependencies, we consider theresidual spectrograms E mk ( z ) = X m − diag( z ) Y k , m = 1 , . . . , M, (10)where ( X m ) fn = x mfn , ( Y k ) fn = y kfn , and z =[ z , . . . , z F ] (cid:62) . Then, the MLE of z mk is { ˆ z mk } Mm =1 = arg min z ,..., z M ∈ C F − log p e (cid:16) { E mk ( z m ) } Mm =1 (cid:17) , where p e is the probability density function corresponding tothe chosen error model. Note that ˆ z mk = [ a mk , . . . , a mkF ] (cid:62) .While there are lots of possible choices for the error func-tion, inspired by the success of spherical contrast functions inIVA [15], we propose to use p e such that − log p e ( { E m } Mm =1 ) = M (cid:88) m =1 (cid:107) E m (cid:107) pp,q + const , (11)where the mixed (cid:96) p,q -norm is deﬁned as (cid:107) E (cid:107) p,q = (cid:88) n (cid:88) f | e fn | q  pq  p . (12) . . . . . . . . . . p0.10.30.50.70.91.11.31.51.71.9 q . . . . . . . . . . p3 mics . . . . . . . . . . p4 mics7.58.08.59.09.5 3.03.54.04.55.05.56.0 54321 (a) AuxIVA, SDR . . . . . . . . . . p0.10.30.50.70.91.11.31.51.71.9 q . . . . . . . . . . p3 mics . . . . . . . . . . p4 mics22.523.023.524.024.525.025.5 18192021 10.010.511.011.512.0 (b) AuxIVA, SIR . . . . . . . . . . p0.10.30.50.70.91.11.31.51.71.9 q . . . . . . . . . . p3 mics . . . . . . . . . . p4 mics420246 642024 10987654 (c) ILRMA, SDR . . . . . . . . . . p0.10.30.50.70.91.11.31.51.71.9 q . . . . . . . . . . p3 mics . . . . . . . . . . p4 mics20.521.021.522.022.523.023.5 17.017.518.018.519.019.520.0 8.008.258.508.759.009.259.509.75 (d) ILRMA, SIR . . . . . . . . . . p0.10.30.50.70.91.11.31.51.71.9 q . . . . . . . . . . p3 mics . . . . . . . . . . p4 mics3.03.54.04.55.05.56.0 2.02.53.03.54.04.5 0.20.40.60.81.01.2 (e) ILRMA-T, SDR . . . . . . . . . . p0.10.30.50.70.91.11.31.51.71.9 q . . . . . . . . . . p3 mics . . . . . . . . . . p4 mics8.89.09.29.49.69.810.010.2 7.257.507.758.008.258.508.75 3.43.63.84.04.24.4 (f) ILRMA-T, SIR

Figure 1:

Average SDR and SIR performance of GMDP. Bright and dark colors indicate high and low performance, respectively.

In Laplace AuxIVA [15], an (cid:96) , -norm is used with the ef-fect of promoting group sparsity in the frames. That is, ac-tive frames are sparse, but within an active frame, frequenciesare not sparse. Using an (cid:96) p,q -norm, it is possible to optimizethe desired sparsity of both frames, and frequency components.This makes sense for speech and music signals that are typicallysomewhat sparse in both due to harmonics and non-stationarity.

4. Optimization with MM Algorithm

The scaling factor to obtain the source images under the pro-posed GMDP are ˆ z mk = arg min z ∈ C F (cid:107) E mk ( z ) (cid:107) pp,q . (13)Unlike the MDP, there is unfortunately no closed form solutionfor this problem. Nevertheless, it can be efﬁciently tackled byan IRLS scheme derived using the MM technique [11].For the optimization of an objective f ( θ ) , the MM tech-nique introduces a surrogate Q ( θ , ˆ θ ) such that Q ( ˆ θ , ˆ θ ) = f ( ˆ θ ) , and Q ( θ , ˆ θ ) ≥ f ( θ ) , ∀ θ , ˆ θ . (14)Then, the sequence of iterates θ t = arg min θ Q ( θ , θ t − ) , t = 1 , . . . , T, (15)monotonically decreases the cost function since, f ( θ t − ) = Q ( θ t − , θ t − ) ≥ min θ Q ( θ , θ t − ) ≥ f ( θ t ) . To construct the surrogate, we use the following inequality, r q ≤ q r − q r + const , < q ≤ , (16) that is derived from an inequality for super-Gaussiansources [24] or from concave-convex arguments [12]. For val-ues of < p ≤ q ≤ , by applying (16) twice, we have Q ( E ; (cid:98) E ) = (cid:88) n,f w fn ( (cid:98) E ) | e fn | + C ≥ (cid:107) E (cid:107) pp,q , (17)where C is a constant and with weights w fn ( (cid:98) E ) = p   F (cid:88) f (cid:48) =1 | ˆ e f (cid:48) n | q  (cid:16) − pq (cid:17) | ˆ e fn | − q  − . (18)Equality holds for E = (cid:98) E .Finally, given the current iterate z t , the next iterate is ob-tained by minimizing Q ( E mk ( z ) ; E mk ( z t )) , i.e., z t +1 ← (cid:80) n w fn ( z t ) x fn y ∗ fn (cid:80) n w fn ( z t ) | y fn | (19)where w fn ( z ) is short for w fn ( E mk ( z )) .

5. Experiments

We use the pyroomacoustics package to simulate a hun-dred random rooms with reverberation time ( T ) between

60 ms and

500 ms [25]. The microphone array is circular withdiameter chosen so that neighboring elements are apart,and is placed at random in the room. The sources are placed atrandom but so that they fall within [ d crit , d crit + 1 m] from thearray. Here, d crit = 0 . (cid:112) V /T m is the critical distance,able 1: Average performance of different algorithms. SDR/SIR refer to SI-SDR/SI-SIR [22] for AuxIVA and ILRMA, and to bss eval’sSDR/SIR [23] for ILRMA-T. This is due to the latter using the anechoic signals as reference.

PB MDP GMDP/SDR GMDP/SIR GMDP/SIR-10 GMDP/SDR-FAlgo. Mics SDR SIR SDR SIR SDR SIR SDR SIR SDR SIR SDR SIRAuxIVA [15] 2 9.72 22.78 9.50 22.16 -0.62 -0.72 -0.68 10.36ILRMA [16] 2 7.06 21.55 7.36 20.19 -3.22

ILRMA-T [17] 2 5.99 10.02 5.87 9.84

Best parameters for the proposed algorithm.

Criteria

SDR SIR- SDR-FAlgo. Mics p q N p q N p q N

AuxIVA 2 0.8 1.9 4 0.4 0.8 9 1.1 1.7 43 0.7 1.6 5 0.3 1.0 8 1.1 1.7 44 1.4 1.8 3 1.1 1.5 4 1.1 1.7 4ILRMA 2 1.0 2.0 3 0.5 1.4 6 1.3 1.9 33 0.9 1.6 4 0.5 0.9 8 1.3 1.9 34 1.8 2.0 3 1.7 2.0 3 1.3 1.9 3ILRMA-T 2 0.6 1.5 5 0.5 0.8 10 0.4 1.5 63 0.4 1.6 6 0.3 0.9 10 0.4 1.5 64 0.1 1.7 8 0.1 1.1 10 0.4 1.5 6 with V being the volume of the room [26]. The source signalsare speech utterances from the CMU Arctic database [27, 28].We evaluate the effectiveness of GMDP on three deter-mined BSS algorithms (i.e. same number of sources and mi-crophones): AuxIVA [15], IVA with simple power based sourcemodel, ILRMA [16], IVA with non-negative low-rank modelILRMA-T [17], joint BSS and dereverberation algorithm withnon-negative low-rank source model. Then, the source imageare further estimated with PB [7], MDP [20], and the pro-posed GMDP. For GMDP, the values of p and q are furthersweeped in increments of 0.1 in their range so that we canﬁnd the best parameters. AuxIVA and ILRMA are evaluatedin terms of scale invariant signal-to-distortion ratio (SI-SDR)and signal-to-interference ratio (SI-SIR) [22] and the clean re-verberant microphone signals are used as reference. The eval-uation of ILRMA-T is different since it also dereverberates thesignals. We evaluate it with bss eval [23] in terms of conven-tional SDR and SIR, on the clean, anechoic microphone signals.The metric from bss eval forgives a 512 taps ﬁlter (

32 ms at

16 kHz ), which accounts for the residual reverberation. ForGMDP, we run the MM algorithm for 100 iterations or until (cid:107) z t − z t − (cid:107) / (cid:107) z t − (cid:107) ≤ . , whichever comes ﬁrst. The ref-erence to compute the source images is the ﬁrst microphone. Fig. 1 shows the result of the parameter sweep for p and q . Theheatmaps show high and low SDR/SIR with brighter and darkercolors, respectively. We observe that general trends of SDRand SIR are quite different. SDR is poor for small values of p, q and generally improves going towards 2. On the contrary, SIR improves towards smaller values. This is expected sinceSDR measures faithfulness to the reference signal while SIRmeasures effectiveness of separation.In Table 1, we compare the performance of PB and MDP tothat of GMDP. Since a balance between SDR and SIR needs tobe found, we compare a few strategies for picking the best p and q . We ﬁrst note that the proposed method performs better for allstrategies. GMDP/SDR chooses p and q yielding the highestSDR. Under this choice, the SIR gain tends to be modest. How-ever, from Fig. 1 we also note that the SDR changes little over alarge range of parameter values. Thus, the strategy GMDP/SIRchooses p and q yielding the largest SIR under the constraintthat the SDR is no less than that of MDP (i.e., p = q = 2 ).In this case, the SIR increases by about to for mostalgorithms with no decrease of SDR. Now, in some cases, espe-cially for ILRMA-T, some parameters may lead to a large iter-ation count of the MM algorithm. The GMDP/SIR-10 strategyis the same as the previous one, but further limits the medianiteration count to 10. This is achieved at very little cost in ei-ther metrics. Finally, it may be of practical interest to ﬁx p and q independent of the channel count. GMDP/SDR-F maximizesthe average SDR over all channel counts. While still improvingover PB and MDP in most cases, the gain is more modest andcase-by-case choice of p and q seems necessary to obtain thebest performance. This makes sense since the spectrogram errordistribution varies according to the number of residual sources.Table 2 contains all the parameters used in this experiment andthe median number of iterations of the MM algorithm.

6. Conclusions

We proposed a new method for the estimation of source imagesfrom the signals separated by BSS algorithms. The method gen-eralizes the traditional minimal distortion principle to maximumlikelihood estimation with a problem speciﬁc residual spectro-gram model. Concretely, we proposed to minimize a mixed-norm that allows to promote sparsity at different rates in timeand frequency. The optimization is carried out by a simple MMalgorithm that is both fast and straightforward to implement.We demonstrate the effectiveness of the method on several BSSand joint BSS-dereverberation algorithms. We show that theproposed method allows to improve the separation by to without degradation in SDR and with minimal computa-tional overhead. Finally, we point out that the proposed methodcan be combined with further post-processing using beamform-ing, as recently proposed [29]. . References [1] P. Comon and C. Jutten, Handbook of blind source separation: in-dependent component analysis and applications , 1st ed. Oxford,UK: Academic Press/Elsevier, 2010.[2] P. Comon, “Independent component analysis, a new concept?”

Signal Processing , vol. 36, no. 3, pp. 287–314, 1994.[3] P. Smaragdis, “Blind separation of convolved mixtures in the fre-quency domain,”

Neurocomputing , vol. 22, no. 1-3, pp. 21–34,Nov. 1998.[4] H. Sawada, S. Araki, and S. Makino, “Measuring depen-dence of bin-wise separated signals for permutation alignment infrequency-domain BSS,” in

Proc. IEEE ISCAS , New Orleans, LA,USA, May 2007, pp. 3247–3250.[5] T. Kim, H. T. Attias, S.-Y. Lee, and T.-W. Lee, “Blind source sep-aration exploiting higher-order frequency dependencies,”

IEEETrans. Audio, Speech, Language Process. , vol. 15, no. 1, pp. 70–79, Dec. 2006.[6] A. Hiroe, “Solution of permutation problem in frequency domainICA, using multivariate probability density functions,” in

ASI-ACRYPT 2016 . Berlin, Heidelberg: Springer Berlin Heidelberg,2006, pp. 601–608.[7] N. Murata, S. Ikeda, and A. Ziehe, “An approach to blind sourceseparation based on temporal structure of speech signals,”

Neuro-computing , vol. 41, no. 1-4, pp. 1–24, Oct. 2001.[8] Z. Koldovsk´y and F. Nesta, “Performance analysis of source im-age estimators in blind source separation,”

IEEE Trans. SignalProcess. , vol. 65, no. 16, pp. 4166–4176, Jun. 2017.[9] R. Tibshirani, “Regression shrinkage and selection via the lasso,”

J R STAT SOC B , vol. 58, no. 1, pp. 267–288, Jan. 1996.[10] E. J. Candes and M. B. Wakin, “An introduction to compressivesampling,”

IEEE Signal Process. Mag. , vol. 25, no. 2, pp. 21–30,Mar. 2008.[11] K. Lange,

MM optimization algorithms . SIAM, 2016.[12] N. Murata, S. Koyama, N. Takamune, and H. Saruwatari, “Sparserepresentation using multidimensional mixed-norm penalty withapplication to sound ﬁeld decomposition,”

IEEE Trans. SignalProcess. , vol. 66, no. 12, pp. 3327–3338, May 2018.[13] I. Daubechies, R. DeVore, M. Fornasier, and C. S. G¨unt¨urk, “It-eratively reweighted least squares minimization for sparse recov-ery,”

Communications on Pure and Applied Mathematics , vol. 63,no. 1, pp. 1–38, Jan. 2010.[14] M. Kowalski, “Sparse regression using mixed norms,”

Appl Com-put Harmon A , vol. 27, no. 3, pp. 303–324, Nov. 2009.[15] N. Ono, “Stable and fast update rules for independent vector anal-ysis based on auxiliary function technique,” in

Proc. IEEE WAS-PAA , New Paltz, NY, USA, Oct. 2011, pp. 189–192. [16] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, andH. Saruwatari, “Determined blind source separation unifying in-dependent vector analysis and nonnegative matrix factorization,”

IEEE/ACM Trans. Audio Speech Lang. Process. , 2016.[17] R. Ikeshita, N. Ito, T. Nakatani, and H. Sawada, “A unifyingframework for blind source separation based on a joint diagonal-izability constraint,” in

Proc. IEEE EUSIPCO , Sep. 2019.[18] J. Allen, “Short term spectral analysis, synthesis, and modiﬁca-tion by discrete Fourier transform,”

IEEE Trans. Acoust., Speech,Signal Process. , vol. 25, no. 3, pp. 235–238, Jun. 1977.[19] A. Hyv¨arinen, “Fast and robust ﬁxed-point algorithms for inde-pendent component analysis,”

IEEE Trans. Neural Netw. , vol. 10,no. 3, pp. 626–634, May 1999.[20] K. Matsuoka and S. Nakashima, “Minimal distortion principle forblind source separation,” in

Proc. ICA , San Diego, Dec. 2001, pp.722–727.[21] K. Matsuoka, “Minimal distortion principle for blind source sepa-ration,” in

Proc. SICE , Osaka, Japan, Aug. 2002, pp. 2138–2143.[22] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR -half-baked or well done?” in

Proc. IEEE ICASSP , Brighton, UK,May 2019, pp. 626–630.[23] E. Vincent, R. Gribonval, and C. Fevotte, “Performance mea-surement in blind audio source separation,”

IEEE Trans. Audio,Speech, Language Process. , vol. 14, no. 4, pp. 1462–1469, Jun.2006.[24] N. Ono and S. Miyabe, “Auxiliary-function-based independentcomponent analysis for super-Gaussian sources,”

Proc. LVA/ICA ,vol. 6365, no. 6, pp. 165–172, Sep. 2010.[25] R. Scheibler, E. Bezzam, and I. Dokmani´c, “Pyroomacoustics: APython package for audio room simulations and array processingalgorithms,” in

Proc. IEEE ICASSP , Calgary, CA, Apr. 2018, pp.351–355.[26] H. Kuttruff,

Room acoustics . CRC Press, 2009.[27] J. Kominek and A. W. Black, “CMU ARCTIC databases forspeech synthesis,” Language Technologies Institute, School ofComputer Science, Carnegie Mellon University, Tech. Rep.CMU-LTI-03-177, 2003.[28] R. Scheibler, “CMU ARCTIC concatenated 15s,” Zenodo.[Online]. Available: http://doi.org/10.5281/zenodo.3066489[29] S. Araki, N. Ono, K. Kinoshita, and M. Delcroix, “Projection backonto ﬁltered observations for speech separation with distributedmicrophone array,” in