Joint Dereverberation and Separation with Iterative Source Steering
Taishi Nakashima, Robin Scheibler, Masahito Togami, Nobutaka Ono
JJOINT DEREVERBERATION AND SEPARATION WITH ITERATIVE SOURCE STEERING
Taishi Nakashima ♦ , ♣ , Robin Scheibler ♣ , Masahito Togami ♣ , Nobutaka Ono ♦♦ Tokyo Metropolitan University, Tokyo, Japan. ♣ LINE Corporation, Tokyo, Japan.
ABSTRACT
We propose a new algorithm for joint dereverberation and blindsource separation (DR-BSS). Our work builds upon the IRLMA-Tframework that applies a unified filter combining dereverberationand separation. One drawback of this framework is that it requiresseveral matrix inversions, an operation inherently costly and with po-tential stability issues. We leverage the recently introduced iterativesource steering (ISS) updates to propose two algorithms mitigatingthis issue. Albeit derived from first principles, the first algorithmturns out to be a natural combination of weighted prediction error(WPE) dereverberation and ISS-based BSS, applied alternatingly.In this case, we manage to reduce the number of matrix inversionto only one per iteration and source. The second algorithm updatesthe ILRMA-T matrix using only sequential ISS updates requiringno matrix inversion at all. Its implementation is straightforward andmemory efficient. Numerical experiments demonstrate that bothmethods achieve the same final performance as ILRMA-T in termsof several relevant objective metrics. In the important case of twosources, the number of iterations required is also similar.
Index Terms — Blind source separation, dereverberation, jointoptimization, independent low-rank matrix analysis, iterative sourcesteering.
1. INTRODUCTION
Speech signals recorded by a microphone are routinely contami-nated by reverberation and interference. Blind source separation(BSS) [1–3], e.g., independent component analysis (ICA) [4], inde-pendent vector analysis (IVA) [5–8], and dereverberation (DR) [9]techniques, e.g., weighted prediction error (WPE) [10], are all coun-termeasures that have been proposed to recover the speech quality re-quired for communication, speech diarization, and automatic speechrecognition (ASR) systems. Historically, DR and BSS have evolvedseparately, and their joint optimization has not yet matured. Jointoptimization is highly desirable to realize DR and BSS in the sameframework (DR-BSS) as it typically leads to higher speech quality.DR-BSS algorithms have been actively studied since WPE [10]was introduced [11–16]. A popular approach is to combine WPE [10]with a BSS algorithm such as Independent Low-Rank Matrix Anal-ysis (ILRMA) [17]. Early studies [12, 13] use separate DR and BSSfilters. However, computational cost of these approaches is veryhigh due to the necessity of computing the inverse of a large matrixwhose dimension is the product of the square of the numbers ofmicrophones with the number of taps of the DR filter. The recentlyproposed ILRMA-T [14, 15] overcomes this difficulty by introduc-ing a unified filter combining the DR and BSS filter. Nevertheless,
This work was done while Taishi Nakashima was an intern at LINE Cor-poration.
ILRMA-T still requires to invert two matrices per source and iter-ation. Because DR-BSS algorithms are typically needed in edgeand embedded devices, where computational power is at a premium,inverse matrix computations are best avoided.ILRMA-T derives the update equations for its DR-BSS matrixfrom the iterative projection (IP) rules of BSS [8]. In the BSS con-text, some of the authors have proposed iterative source steering(ISS), an alternative to IP that is more computationally efficient anddoes not require matrix inversion [18]. Thus, the ISS based approachis more stable than the IP one. To the best of our knowledge, ISSbased DR-BSS has not been studied yet.In this paper, we propose a joint optimization framework forDR-BSS with ISS [18]. The proposed method optimizes the samecost function as ILRMA-T, but using the ISS updates. Thus, we callit ILRMA-T-ISS. Two variants of ILRMA-T-ISS are proposed. Thefirst one is obtained by updating all the weights in the DR-BSS ma-trix corresponding to dereverberation in a single step, and apply ISSfor the rest. The resulting algorithm turns out to be a natural combi-nation of WPE and ISS, with their respective updates applied alter-natingly. We call this algorithm ILRMA-T-ISS-JOINT. ILRMA-T-ISS-JOINT reduces the number of matrix inversions to only one periteration and source. The second variant, ILRMA-T-ISS-SEQ, ap-plies sequential ISS updates to the whole matrix. This has the happyconsequence that not a single matrix inversion is required. One prac-tical consequence is that its implementation is straightforward andno external linear algebra library is needed. These properties are allhighly desirable in edge and embedded systems. We conduct nu-merical experiments to confirm the efficacy of the proposed methodin noisy reverberant environment with multiple speech sources. Weconfirm that separation and dereverberation performance are on parwith ILRMA-T-IP, even without the matrix inversions.
2. BACKGROUND2.1. Signal model and notation
Let N and M be the numbers of sources and microphones, respec-tively. Henceforth, we consider the determined case, N = M . Weuse the short-term Fourier transform (STFT) representation of mi-crophone input signals. The microphone input signal is modeled asthe following convolutive mixture: x f,t = D − (cid:88) d =0 A f,d s f,t − d ∈ C N , (1)where f ∈ { , . . . , F } and t ∈ { , . . . , T } are the frequency binand the time frame indices, respectively, A f,d is the mixing matrixwith ( A f,d ) n,m = a n,m,f,d , s f,t is the source signal, and n = { , . . . , N } is the source channel index. a r X i v : . [ ee ss . A S ] F e b n the rest of the manuscript, (cid:62) , H , and det denote the trans-pose, Hermitian transpose, and determinant of a vector/matrix, re-spectively. We denote the n th canonical basis vector by e n , an allzero vector , and the identity matrix by I . WPE [10] is a popular approach for DR. In WPE, (1) is converted tothe following auto-regressive (AR) model: x f,t = L − (cid:88) τ =0 Z f,τ x f,t − τ , (2)where Z is a matrix which contains the AR coefficients, and L is thetap-length of the AR model. The WPE assumes that there is only onespeech source, and Z is optimized with the time-varying variance ofthe speech source r f,t as follows: Z f = (cid:32)(cid:88) t x f,t x H f,t r f,t (cid:33) (cid:32)(cid:88) t x f,t x H f,t r f,t (cid:33) − , (3)where x f,t = (cid:2) x (cid:62) f,t − ∆ · · · x (cid:62) f,t − ∆ − L +1 (cid:3) (cid:62) ∈ C N ( L +1) , ∆ isthe delay, and Z f = (cid:2) Z f, · · · Z f,L − (cid:3) . The dereverberatedsignal z f,t is obtained as z f,t = x f,t − Z f x f,t . Then, we update r f,t = | z f,t | M . Thus, Z f and r f,t are updated in an iterative manner. Cascade connection of the WPE and the BSS is not optimum be-cause the WPE assumes that there is only one source. In [12, 13],joint optimization of the WPE and the BSS is performed by using aWPE filter followed by a BSS filter. The output signal is obtained as y f,t = W f (cid:0) x f,t − Z f x f,t (cid:1) . A determined approach is proposedin [13] for optimization of W f and Z f sequentially, such that theseparated signal y f,t is the maximum likelihood estimator of s f,t under the assumptions1. the sources are statistically independent,2. a source signal at each time-frequency bin belongs to a com-plex Gaussian distribution: p ( y n,f,t ) = πr n,f,t exp (cid:16) − | y n,f,t | r n,f,t (cid:17) , where y n,f,t is the n th element of y f,t and r n,f,t is the time-varying variance ofthe n th source,3. r n,f,t is modeled as r n,f,t = (cid:80) Kk =1 c n,k,f b n,t,k , where K isthe number of basis vectors, c n,k,f ≥ is the basis coefficientof the n th component, and b n,t,k ≥ is the time-varyingactivity of the n th component.Parameters are updated to maximize the following negative log-likelihood function J : J = (cid:88) f,t (cid:34) − | det W f | + (cid:88) n (cid:32) | x f,t − Z f x f,t | r n,f,t + log r n,f,t (cid:33)(cid:35) . (4)The IP based parameter optimization [8] can be straightforwardlyapplied for optimization of W f . Non-negative matrix factorization(NMF) is used to update c n,k,f and b n,t,k [17]. The optimal Z f is also obtained straightforwardly by minimizing J . However, when Z f is updated, it is necessary to calculate the inverse matrix of alarge-scale matrix whose dimension is proportional to M L . Thus,computational cost is quite high.As an alternative, ILRMA-T [14, 15] has been proposed.ILRMA-T combines WPE and ILRMA [17] for joint dereverbertionand separation. In ILRMA-T, the output signal is obtained by aunified filter P f as y f,t = P f ˜ x f,t , where ˜ x f,t = (cid:2) x (cid:62) f,t x (cid:62) f,t (cid:3) (cid:62) ∈ C N ( L +1) and P f = W f (cid:2) I − Z f (cid:3) .The cost function of ILRMA-T is equivalent to (4), that is, J = (cid:88) f (cid:34) − | det W f | + (cid:88) n p H n,f V n,f p n,f (cid:35) , (5)where V n,f = 1 T (cid:88) t ˜ x f,t ˜ x H f,t r n,f,t ∈ C N ( L +1) × N ( L +1) is the weightedcovariance matrix of ˜ x f,t .Instead of optimizing W f and Z f sequentially, ILRMA-T opti-mizes each row vector of P f sequentially based on IP [8]. The filterto separate and dereverberate the n th source is defined as p H n,f , i.e.,the n th row vector of P f . It is updated as follows: p n,f ← V − n,f a n,f (cid:113) a H n,f V − n,f a n,f , (6)where a n,f = (cid:18) W − f e n (cid:19) . Thus, calculation of two types of in-verse matrices are needed in the p n,f update. The updates of c n,k,f and b n,t,k are those of NMF. We call this algorithm ILRMA-T-IP.
3. PROPOSED METHOD: ILRMA-T-ISS
We propose a new DR-BSS method to reduce the number of inversematrix computations. The cost function is the same as that of theILRMA-T, which is defined as J = (cid:88) f (cid:34) − | det W f | + (cid:88) n g H n,f V n,f g n,f (cid:35) , (7)where• G f = (cid:20) P f NL × N E NL (cid:21) ∈ C N ( L +1) × N ( L +1) • g H n,f : n th row vector of G f .Optimization of the parameters is done via ISS [18]. When n ≤ N ,ISS updates G (the index of frequency bins omitted) like this, G ← G − v ,n ... v N,n NL × g H n . (8)This update rule is the same as that for BSS. The minimization of (7)with respect to v m,n gives, v m,n = g H m V m g n g H n V m g n ( m (cid:54) = n ) , − ( g H n V n g n ) − ( m = n ) . (9) ∀ ≤ m ≤ N (10)or n > N , we propose two update rules, i.e., IRLMA-T-ISS-JOINT and ILRMA-T-ISS-SEQ. These update rules correspond tothe dereverberation part of the algorithm. We call the first update rule ILRMA-T-ISS-JOINT as it jointly up-dates v m,n>N in the following way, G ← G − v ,n>N ... v N,n>N NL × NL G H n>N where v m,n>N = (cid:2) v m,N +1 · · · v m,N ( L +1) (cid:3) and G n>N = (cid:2) g N +1 · · · g N ( L +1) (cid:3) . Minimization of (7) with respect to v m,n>N for ≤ m ≤ N , yields, v m,n>N = (cid:16) g H m V m G n>N (cid:17) (cid:16) G H n>N V m G n>N (cid:17) − , (11)which can be further expanded as follows, v m,n>N = (cid:32)(cid:88) t y m,f,t x H f,t r m,f,t (cid:33) (cid:32)(cid:88) t x f,t x H f,t r m,f,t (cid:33) − . (12)This equation is very similar to the update of the WPE filter by (3).The latter is updated from the cross correlation between the currentmicrophone input signal and the past microphone input signal. Onthe other hand, in (12), cross-correlation between the estimated out-put signal of the m th speech source and the past microphone inputsignal is calculated. Thus, WPE based DR and ISS based BSS arenaturally combined in this framework. Moreover, it only requiresinversion of one NL × NL matrix, in (12), per iteration and source. We call the second update rule ILRMA-T-ISS-SEQ. Instead of thejoint update of v m,n>N , v m,n is updated for each n > N sequen-tially as follows, G ← G − v ,n ... v N,n NL × g H n . Minimization of (7) with respect to v m,N gives, v m,n = g H m V m g n g H n V m g n , ∀ ≤ m ≤ N. This can be further expanded as, v m,n = (cid:80) t y m,f,t ˜ x ∗ n,f,t r m,f,t (cid:80) t ˜ x n,f,t ˜ x ∗ n,f,t r m,f,t , (13)where ˜ x n,f,t is the n th element of ˜ x f,t . It is shown that inversecalculation is completely unnecessary in ILRMA-T-ISS-SEQ.
4. EXPERIMENT4.1. Setup
We use speech sources from the WSJ corpus [19] for evaluation. Tomake the reverberant mixtures, we perform room simulations withthe pyroomacoustics
Python package [20] in random rectangu-lar rooms with walls between and
10 m length, ceiling between and high. Simulated reverberation times range from
200 ms to
600 ms . The microphone array is circular, with a radius between .
075 m and .
125 m , such that the spacing is at least .
05 m . Thehorizontal location of the microphone array and the speech sourcesis randomly chosen at least . away from the center of the roomsand at least . away from the center of the microphone array,respectively. The vertical location of the microphone array and thesources ranges from . to . and from . and . high,respectively. The distance between the sources is randomly set to beat least . . We add background noise selected from the CHiME3dataset [21] to each simulated signal. The source signals are nor-malized to have unit power at the first microphone. Then we definesignal-to-noise ratio SNR = N/σ , where σ is the variance of un-correlated white noise at the microphones. The SNR ranges from
10 dB to
30 dB .We performed separation and dereverberation for , , and sources for 333 simulated mixtures. The sampling frequencywas
16 kHz , and the STFT frame size (
64 ms ) is with three-quarter overlap. We used a Hann window for analysis and the opti-mally matching window for synthesis. The proposed methods werecompared with ILRMA-T-IP [15], ILRMA-IP [17], and ILRMA-ISS. We also evaluated ILRMA-IP and ILRMA-ISS initialized byWPE [10], that we call WPE+ILRMA-IP and WPE+ILRMA-ISS,respectively. For all ILRMA-T-based methods; ILRMA-T-ISS-JOINT, ILRMA-T-ISS-SEQ, ILRMA-T-IP, we set the tap length L to , the delay parameter ∆ to , the initial DR and BSS filter { P f } f to (cid:2) I N NL (cid:3) , respectively. For all ILRMA-based meth-ods; ILRMA-ISS and ILRMA-IP, we set the initial BSS filter { P f } f to the identity matrix. For all methods, we set the number of iter-ations to , the number of NMF bases K to , initial value of { c n,f,k } n,f,k to , and initial value of { b n,k,t } n,k,t to a randomnumber uniformly distributed over [0 . , , respectively. After sep-aration and dereverberation, the scale of the output was restored byprojection back onto the first microphone [22]. We measured the scale-invariant signal-to-distortion ratio (SI-SDR)and the scale-invariant signal-to-interference ratio (SI-SIR) [23], thecepstrum distance (CD), and the speech-to-reverberation modulationenergy ration (SRMR). We define ∆ SI-SDR and ∆ SI-SIR as thedifference of SI-SDR and SI-SIR, respectively, between before andafter the processing.Fig. 1 shows the separation performance after 100 iterations ofeach algorithm. As a whole, the proposed ILRMA-T-based methodssignificantly outperformed the conventional ILRMA-based meth-ods. Also, they can slightly improve performance compared withWPE+ILRMA-IP and WEP+ILRMA-ISS. ∆ SI-SDR and ∆ SI-SIR of ILRMA-T-ISS are slightly less than that of ILRMA-T-IPbut achieve comparable performance in less time, as describedbelow. We can find that dereverberation improves the separationperformance. The proposed ILRMA-T-ISS-JOINT and ILRMA-T-ISS-SEQ can achieve comparable performance to ILRMA-T-IP.This result is consistent with the reported difference between IP and
Sources
SI-SDR [dB]
Sources
SI-SIR [dB]
ILRMA-T-ISS-JOINTWPE+ILRMA-IP ILRMA-T-ISS-SEQILRMA-ISS ILRMA-T-IPILRMA-IP WPE+ILRMA-ISS
Sources CD Sources
SRMR
Fig. 1 : Average SI-SDR improvements, SI-SIR improvements, CD, and SRMR after 100 iterations. Higher is better for all metrics, exceptCD, for which lower is better.
Runtime [s] S I - S D R [ d B ] Runtime [s]
ILRMA-T-ISS-JOINTWPE+ILRMA-IP ILRMA-T-ISS-SEQILRMA-ISS ILRMA-T-IPILRMA-IP WPE+ILRMA-ISS
Runtime [s]
Fig. 2 : Convergence curves of average SI-SDR improvements for varying number of sources in 100 iterations.ISS-based methods for BSS [18].Fig. 2 shows the comparison of convergence speed. The to-tal runtime of the proposed ILRMA-T-ISS-SEQ is about the sameas that of ILRMA-T-IP, where N = 2 . On the other hand, it ismuch less where N = 3 , . The convergence speed of the proposedILRMA-T-ISS is slightly slower than ILRMA-T-IP but the final per-formance is the same. WPE+ILRMA-ISS and WPE+ILRMA-IPseem to converge the fastest, but the WPE initialization time wasnot included in the figure.
5. CONCLUSION
In this paper, we proposed a joint optimization technique for sourceseparation and dereverberation based on ILRMA-T with ISS. We usethis technique to derive two new algorithms. ILRMA-T-ISS-JOINTperforms a sequence of ISS updates corresponding to the separationpart of the algorithm, followed by a joint update corresponding tothe parameters of the dereverberation. Interestingly, this can be seen as a combination of the ISS and WPE updates applied alternatingly.This form of the algorithm reduces the number of matrix inversionto just one per iteration and source. ILRMA-T-ISS-SEQ gets rid ofinversion altogether by updating all parameters via ISS-style rules.This algorithm is very simple and does not need fancy linear alge-bra libraries. It is a very good candidate for processing in practicaledge or embedded systems. Experimental results showed that whileconceptually simpler, the proposed method performs just as well ona challenging dataset of noisy reverberant speech mixtures. In futurework, we intend to push the method towards real-time applicability,and explore advantages provided by extra microphones, the so-calledoverdetermined case [24].
6. REFERENCES [1] S. Makino, T. Lee, and H. Sawada,
Blind Speech Separation .Springer International Publishing, 2007.2] S. Makino, Ed.,
Audio Source Separation . Springer Interna-tional Publishing, 2018.[3] H. Sawada, N. Ono, H. Kameoka, D. Kitamura, andH. Saruwatari, “A review of blind source separation methods:two converging routes to ILRMA originating from ICA andNMF,”
APSIPA Trans. SIP , vol. 8, 2019.[4] P. Comon, “Independent component analysis, a new concept?”
Signal Processing , vol. 36, no. 3, pp. 287–314, Apr. 1994.[5] T. Kim, H. T. Attias, S.-Y. Lee, and T.-W. Lee, “Blind sourceseparation exploiting higher-order frequency dependencies,”
IEEE/ACM Trans. Audio, Speech, Language Process. , vol. 15,no. 1, pp. 70–79, 2006.[6] A. Hiroe, “Solution of permutation problem in frequency do-main ICA, using multivariate probability density functions,” in
Proc. ICA , 2006, pp. 601–608.[7] N. Ono and S. Miyabe, “Auxiliary-function-based indepen-dent component analysis for super-Gaussian sources,” in
Proc.LVA/ICA , 2010, pp. 165–172.[8] N. Ono, “Stable and fast update rules for independent vectoranalysis based on auxiliary function technique,” in
Proc. WAS-PAA , 2011, pp. 189–192.[9] P. Naylor and N. Gaubitch,
Speech Dereverberation , 1st ed.Springer Publishing Company, Incorporated, 2010.[10] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi,and B. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,”
IEEE/ACM Trans. Au-dio, Speech, Language Process. , vol. 18, no. 7, pp. 1717–1731,2010.[11] T. Yoshioka, T. Nakatani, M. Miyoshi, and H. G. Okuno,“Blind separation and dereverberation of speech mixtures byjoint optimization,”
IEEE/ACM Trans. Audio, Speech, Lan-guage Process. , vol. 19, no. 1, pp. 69–84, Jan. 2011.[12] M. Togami, Y. Kawaguchi, R. Takeda, Y. Obuchi, andN. Nukaga, “Optimized speech dereverberation from proba-bilistic perspective for time varying acoustic transfer function,”
IEEE/ACM Trans. Audio, Speech, Language Process. , vol. 21,no. 7, pp. 1369–1380, Jul. 2013.[13] H. Kagami, H. Kameoka, and M. Yukawa, “Joint separa-tion and dereverberation of reverberant mixtures with de- termined multichannel non-negative matrix factorization,” in
Proc. ICASSP , Apr. 2018, pp. 31–35.[14] R. Ikeshita, N. Ito, T. Nakatani, and H. Sawada, “A unifyingframework for blind source separation based on a joint diago-nalizability constraint,” in
EUSIPCO , Sep. 2019, pp. 1–5.[15] R. Ikeshita, N. Ito, T. Nakatani, and H. Sawada, “Independentlow-rank matrix analysis with decorrelation learning,” in
Proc.WASPAA , Oct. 2019, pp. 288–292.[16] M. Togami, “Multi-channel speech source separation and dere-verberation with sequential integration of determined and un-derdetermined models,” in
Proc. ICASSP , 2020, pp. 231–235.[17] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, andH. Saruwatari, “Determined blind source separation unifyingindependent vector analysis and nonnegative matrix factoriza-tion,”
IEEE/ACM Trans. Audio, Speech, Language Process. ,vol. 24, no. 9, pp. 1622–1637, 2016.[18] R. Scheibler and N. Ono, “Fast and stable blind source sepa-ration with rank-1 updates,” in
Proc. ICASSP , 2020, pp. 236–240.[19] L. D. Consortium and N. M. I. Group,
CSR-II (WSJ1) Com-plete LDC94S13A . Philadelphia: Linguistic Data Consortium,1994.[20] R. Scheibler, E. Bezzam, and I. Dokmani´c, “Pyroomacoustics:A Python package for audio room simulation and array pro-cessing algorithms,” in
Proc. ICASSP , Apr. 2018, pp. 351–355.[21] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The thirdchime speech separation and recognition challenge,”
ComputerSpeech and Langage , vol. 46, no. C, pp. 605–626, Nov. 2017.[22] N. Murata, S. Ikeda, and A. Ziehe, “An approach to blindsource separation based on temporal structure of speech sig-nals,”
Neurocomputing , vol. 41, no. 1-4, pp. 1–24, Oct. 2001.[23] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR— half-baked or well done?” in
Proc. ICASSP , May 2019, pp.626–630.[24] M. Togami and R. Scheibler, “Over-determined speech sourceseparation and dereverberation,” in