[PDF] Instantaneous PSD Estimation for Speech Enhancement based on Generalized Principal Components

Abstract

Power spectral density (PSD) estimates of various microphone signal components are essential to many speech enhancement procedures. As speech is highly non-nonstationary, performance improvements may be gained by maintaining time-variations in PSD estimates. In this paper, we propose an instantaneous PSD estimation approach based on generalized principal components. Similarly to other eigenspace-based PSD estimation approaches, we rely on recursive averaging in order to obtain a microphone signal correlation matrix estimate to be decomposed. However, instead of estimating the PSDs directly from the temporally smooth generalized eigenvalues of this matrix, yielding temporally smooth PSD estimates, we propose to estimate the PSDs from newly defined instantaneous generalized eigenvalues, yielding instantaneous PSD estimates. The instantaneous generalized eigenvalues are defined from the generalized principal components, i.e. a generalized eigenvector-based transform of the microphone signals. We further show that the smooth generalized eigenvalues can be understood as a recursive average of the instantaneous generalized eigenvalues. Simulation results comparing the multi-channel Wiener filter (MWF) with smooth and instantaneous PSD estimates indicate better speech enhancement performance for the latter. A MATLAB implementation is available online.

Full PDF

aa r X i v : . [ ee ss . A S ] J u l Instantaneous PSD Estimationfor Speech Enhancement based onGeneralized Principal Components

Thomas Dietzen, Marc Moonen, Toon van Waterschoot

Dept. of Electrical Engineering (ESAT)STADIUS Center for Dynamical Systems, Signal Processing and Data AnalyticsKU Leuven

Leuven, Belgium { thomas.dietzen, marc.moonen, toon.vanwaterschoot } @esat.kuleuven.be Abstract —Power spectral density (PSD) estimates of variousmicrophone signal components are essential to many speechenhancement procedures. As speech is highly non-nonstationary,performance improvements may be gained by maintaining time-variations in PSD estimates. In this paper, we propose aninstantaneous PSD estimation approach based on generalizedprincipal components. Similarly to other eigenspace-based PSDestimation approaches, we rely on recursive averaging in or-der to obtain a microphone signal correlation matrix estimateto be decomposed. However, instead of estimating the PSDsdirectly from the temporally smooth generalized eigenvaluesof this matrix, yielding temporally smooth PSD estimates, wepropose to estimate the PSDs from newly deﬁned instantaneousgeneralized eigenvalues, yielding instantaneous PSD estimates.The instantaneous generalized eigenvalues are deﬁned from thegeneralized principal components, i.e. a generalized eigenvector-based transform of the microphone signals. We further showthat the smooth generalized eigenvalues can be understood as arecursive average of the instantaneous generalized eigenvalues.Simulation results comparing the multi-channel Wiener ﬁlter(MWF) with smooth and instantaneous PSD estimates indicatebetter speech enhancement performance for the latter. A MAT-LAB implementation is available online.

Index Terms —speech enhancement, instantaneous PSD estima-tion, generalized eigenvalue decomposition, generalized principalcomponents

I. I

NTRODUCTION

In speech enhancement [1]–[3], recorded microphone sig-nals constitute a mixture of speech, reverberation and noise.In order to enhance the mixture, many approaches rely onpower spectral density (PSD) estimates of the various mixturecomponents.While the problem of PSD estimation has attracted muchinterest [1], [3]–[12] in speech enhancement, somewhat lessattention [4]–[6], [12] is paid to the temporal behavior of

This work was carried out at the ESAT Laboratory of KU Leuven, inthe frame of KU Leuven internal fund C2-16-00449; VLAIO O&O Projectno. HBC.2017.0358; EU FP7-PEOPLE Marie Curie Initial Training Networkfunded by the European Commission under Grant Agreement no. 316969;the European Union’s Horizon 2020 research and innovation program/ERCConsolidator Grant no. 773268. This paper reﬂects only the authors’ viewsand the Union is not liable for any use that may be made of the containedinformation.

PSD estimates. As the PSD is a statistical property deﬁnedby means of an expectation operator, its estimation typicallyinvolves temporal averaging, which approximates the expecta-tion and requires tuning. Note that while temporal averaginglacks practical alternatives, it causes temporal smoothing andhence may be considered non-ideal in case of speech signals,which are highly non-stationary. Indeed, non-stationarity ofspeech may even be explicitly exploited in a number of speechenhancement approaches [1], [13]–[16], such that quicklytime-varying PSD estimates potentially yield a better perfor-mance than slowly time-varying PSD estimates. In literature,quickly time-varying PSD estimates are commonly based onshort-term statistics, e.g., the local minima of the smoothedmicrophone signal spectrum [4] or short-term temporal cor-relations [5], [6]. In [12], we have proposed to restore non-stationarities by desmoothing the generalized eigenvalues ofthe temporally smooth microphone signal correlation matrixestimate.In this paper, we propose a multi-microphone eigenspace-based instantaneous

PSD estimation approach based on gen-eralized principal components. Similarly to other eigenspace-based PSD estimation approaches [5], [7], [10], [12], werely on recursive averaging in order to obtain a microphonesignal correlation matrix estimate to be decomposed. However,instead of estimating the PSDs directly from the temporally smooth generalized eigenvalues of this matrix, yielding tem-porally smooth PSD estimates, we propose to estimate thePSDs from newly deﬁned instantaneous generalized eigenval-ues, yielding instantaneous PSD estimates. Here, the instan-taneous generalized eigenvalues are deﬁned from the gener-alized principal components, i.e. a generalized eigenvector-based transform of the microphone signals. As to be shown,the smooth generalized eigenvalues can be understood as arecursive average of the newly deﬁned instantaneous gener- Strictly speaking, the term ’PSD’ may be said to be inadequate forthe instantaneous quantities estimated in this paper, as our approach partlybypasses the use of an expectation or its approximation by means of temporalaveraging. Nonetheless, due to the strong relation to expectation-based PSDestimation, we prefer to maintain the terminology. lized eigenvalues. Simulation results comparing the speechenhancement performance of the multi-channel Wiener ﬁlter(MWF) with smooth and instantaneous PSD estimates indicatebetter performance for the latter. A MATLAB implementationand audio examples are available online [17].In Sec. II, we present the signal model. In Sec. III, webrieﬂy review the MWF, which serves as an example for theapplication of PSD estimates and is used to evaluate PSDestimates in this paper. Eigenspace-based PSD estimation isdiscussed in Sec. IV, where we outline an implementationyielding smooth PSD estimates and propose the alternativeapproach yielding instantaneous PSD estimates. Both imple-mentations are evaluated in Sec. V.II. S

IGNAL M ODEL

We employ the following notation: vectors are denoted bylower-case boldface letters, matrices by upper-case boldfaceletters, I denotes the identity matrix, A T , A H , E[ A ] , and k A k F denote the transpose, the complex conjugate transpose, theexpected value, and the Frobenius norm of the matrix A . Theoperation diag[ A ] creates a column vector from the diagonalelements of the matrix A , while Diag[ a ] creates a diagonalmatrix from the elements of the vector a . The exponentialfunction with argument a is denoted by exp[ a ] .In the short-time Fourier transform (STFT) domain, with m , l , and k indexing the microphone, the frame, and the frequencybin, respectively, and M the number of microphones, let themicrophone signals be denoted by y m ( l, k ) ∈ C with m =1 , . . . , M . As we treat all frequency bins independently, thefrequency bin index is omitted in the following. We deﬁne thestacked microphone signal vector y ( l ) ∈ C M , y ( l ) = (cid:0) y ( l ) · · · y M ( l ) (cid:1) T , (1)composed of the reverberant speech component x ( l ) originat-ing from a single point source and the noise component v ( l ) , y ( l ) = x ( l ) + v ( l ) . (2)The reverberant speech component x ( l ) may be decomposedinto the early component x e ( l ) containing the direct compo-nent and early reﬂections, and the late reverberant component x ℓ ( l ) containing late reﬂections, i.e. x ( l ) = x e ( l ) + x ℓ ( l ) , (3)which are assumed to have distinct spatial properties asoutlined below. Early reﬂections are assumed to arrive withinthe same frame, with the early components in x e ( l ) related bythe relative early transfer functions (RETFs) in h ∈ C M , i.e. x e ( l ) = h s ( l ) , (4)Here, h is assumed to be relative to the ﬁrst microphone, i.e. h = 1 , and s ( l ) = x e | ( l ) denotes the early component in theﬁrst microphone, in the following referred to as early speechsource image. We consider h to be known or previously esti-mated [3], [12], [18]. We assume that x e ( l ) , x ℓ ( l ) , and v ( l ) aremutually uncorrelated [7]–[12]. Let Ψ y ( l ) = E[ y ( l ) y H ( l )] ∈ C M × M denote the microphone signal correlation matrix, and let Ψ x e ( l ) , Ψ x ℓ ( l ) , and Ψ v ( l ) be similarly deﬁned. With (2)–(4), we then ﬁnd Ψ y ( l ) = Ψ x e ( l ) + Ψ x ℓ ( l ) + Ψ v ( l ) , (5)wherein Ψ x e ( l ) has rank one and is expressed by Ψ x e ( l ) = ϕ s ( l ) hh H , (6)with ϕ s ( l ) denoting the PSD of the early speech source image s ( l ) . Assuming that x ℓ ( l ) and v ( l ) may be modeled as diffuse[7]–[11], [19] with coherence matrix Γ ∈ C M × M , which maybe computed from the microphone array geometry [19] andis therefore considered to be known, we may write Ψ x ℓ ( l ) + Ψ v ( l ) as Ψ x ℓ ( l ) + Ψ v ( l ) = ϕ d ( l ) Γ , (7)with ϕ d ( l ) = ϕ x ℓ ( l ) + ϕ v ( l ) , (8)and ϕ x ℓ ( l ) and ϕ v ( l ) denoting the PSD of the late re-verberant component x ℓ ( l ) and the noise component v ( l ) ,respectively. With s ( l ) representing speech, and in particularif v ( l ) represents babble noise, both PSDs ϕ s ( l ) and ϕ d ( l ) may be considered highly non-stationary, while the associatedcoherence matrices hh H and Γ are often considered time-invariant [7]–[10].In the remainder, as we mostly consider the single frame l only, we also drop the frame index for conciseness and referback to it only where necessary, namely when we differentiatethe frames l and l − in recursive equations.III. M ULTI -C HANNEL W IENER F ILTER

PSD estimates are used in a variety of speech enhancementprocedures. In this paper, we evaluate our PSD estimationapproach in Sec. V by means of the MWF, which is thereforebrieﬂy summarized below.The MWF w MWF is obtained [2], [3] by minimizing theexpected error between the ﬁlter output and the early speechsource image, i.e. w MWF = arg min w E (cid:2) | w H y − s | (cid:3) = ϕ s Ψ − y h . (9)It is well known that the MWF can be decomposed [2],[3] into a minimum variance distortionless response (MVDR)beamformer and a spectral gain as w MWF = Γ − hh H Γ − h | {z } MVDRbeamformer · ϕ s ϕ s + ϕ d h H Γ − h | {z } spectral gain . (10)Hence, if both Γ and h are assumed to be known or previouslyestimated, the problem of implementing the MWF reduces toestimating the PSDs ϕ s and ϕ d . If, on the one hand, thePSD estimates to be obtained are slowly time-varying, thespectral gain will contribute to speech enhancement mostlythrough variations across frequency. If, on the other hand,instantaneous PSD estimates are obtained, the spectral gainwill vary across both frequency and time and thereby act as aspectro-temporal mask [13]–[15].V. E IGENSPACE - BASED

PSD E

STIMATION

Multi-microphone PSD estimation is commonly based onthe spatial properties deﬁned in (4)–(8), which may be ex-ploited in an eigenspace decomposition [7], [10], [12]. InSec. IV-A, we ﬁrst introduce an eigenspace model of Ψ y and Γ . In Sec. IV-B, we outline how PSD estimates maybe obtained given an eigenvalue and an eigenspace basisestimate. In Sec. IV-C, we consider an implementation basedon temporally smooth eigenvalues, and in Sec. IV-D, wepropose an implementation based on instantaneous generalizedprincipal components. A. Eigenspace Model

We deﬁne the generalized eigenvalue decomposition(GEVD) [7], [10], [12], [18] of Ψ y and the diffuse coherencematrix Γ , cf. (7), i.e. Ψ y P = ΓP Diag[ λ y ] , (11)where λ y ∈ R M comprises the generalized eigenvalues λ y | m ,and the columns p m of P ∈ C M × M comprise the associatedgeneralized eigenvectors. The generalized eigenvectors in P are uniquely deﬁned up to a scaling factor and, for anyfactorization Γ = Γ / Γ H / , may be chosen such that Γ H / P becomes unitary due to Ψ y and Γ being Hermitian. Thematrices Ψ y and Γ are then diagonalized by P H Ψ y P = Diag[ λ y ] , (12) P H ΓP = I , (13)cf. also (11).While the eigenspace basis P varies with the spatial coher-ence matrices hh H and Γ only and is therefore time-invariantin the assumed spatially stationary scenario, the generalizedeigenvalues in λ y vary with the PSDs ϕ s and ϕ d and henceover time. Using (5) and (7) in (12)–(13) yields Diag[ λ y ] = Diag[ λ x e ] + Diag[ λ d ] , (14)with Diag[ λ x e ] = P H Ψ x e P , (15) Diag[ λ d ] = ϕ d I . (16)In (15), Ψ x e and therefore Diag[ λ x e ] have rank one. Providedthat the generalized eigenvalues and eigenvectors are sortedsuch that λ y | is the largest generalized eigenvalue, λ x e hencetakes the form λ x e = (cid:0) λ x e | · · · (cid:1) T . (17)From (14)–(17) it then follows that λ y | = λ x e | + ϕ d and λ y | m = ϕ d for m > [10]. B. Eigenspace-based PSD Estimation

Assume that an estimate ˆ Ψ y is available, from which theeigenvalue and eigenspace basis estimates ˆ λ y and ˆ P areobtained. Further, assume that the RETF h is known orpreviously estimated. Estimates of ˆ ϕ s and ˆ ϕ d can then beobtained in the following manner. Given ˆ λ y , we ﬁrst obtain ˆ ϕ d and ˆ λ x e | according to (14)–(17) [10] as ˆ ϕ d = 1 M − M X m =2 ˆ λ y | m , (18) ˆ λ x e | = ˆ λ y | − ˆ ϕ d . (19)where the averaging in (18) accounts for modeling and es-timation errors and (19) is guaranteed non-negative. Notingthat ˆ P − = ˆ P H Γ according to (13), we can deﬁne a rank-oneestimate ˆ Ψ x e [12] as ˆ Ψ x e = Γ ˆ P Diag[ˆ λ x e ] ˆ P H Γ = ˆ λ x e | Γ ˆ p ˆ p H Γ (20)with ˆ λ x e similar to (17). An estimate ˆ ϕ s may then be obtainedby minimizing the difference between ˆ Ψ x e and ϕ s hh H ac-cording to (6) [12], i.e. ˆ ϕ s = arg min ϕ s k ϕ s hh H − ˆ Ψ x e k F = ˆ λ x e | (cid:12)(cid:12)(cid:12)(cid:12) h H Γ ˆ p h H h (cid:12)(cid:12)(cid:12)(cid:12) . (21)Note that the temporal characteristics of the estimates ˆ ϕ s and ˆ ϕ d directly depend upon the temporal characteristics of ˆ λ y . C. Smooth Eigenvalue-based Implementation

A temporally smooth estimate of Ψ y = E[ yy H ] , in thefollowing denoted by ˆ Ψ y | sm , is typically obtained by recur-sively averaging yy H using some pre-deﬁned forgetting factor ζ ∈ (0 , , namely by ˆ Ψ y | sm ( l ) = ζ ˆ Ψ y | sm ( l −

1) + (1 − ζ ) y ( l ) y H ( l ) . (22)The forgetting factor ζ may be expressed in terms of a timeconstant τ as ζ = exp (cid:2) − R/f s τ (cid:3) , (23)where R is the STFT frame shift in samples, f s is the samplingrate and τ may be thought of as an equivalent window length.Given ˆ Ψ y | sm , we can perform the GEVD ˆ Ψ y | sm ˆ P = Γ ˆ P Diag[ˆ λ y | sm ] similar to (11)–(13) in each frame l . Here, ˆ P slightly ﬂuctuates over time due to modeling and estimationerrors (while P itself is time-invariant, cf. Sec. IV-A), and ˆ λ y | sm is a smooth estimate of λ y . Consequently, if we estimatethe PSDs ϕ s and ϕ d directly from ˆ λ y | sm according to Sec.IV-B, we obtain equally smooth estimates ˆ ϕ s | sm and ˆ ϕ d | sm .Note that in order to span all M eigenspace dimensions andhence to obtain a meaningful decomposition, ˆ Ψ y | sm needs tobe well-conditioned, and so τ should scale with M and mustbe sufﬁciently large. Since h = 1 , cf. Sec. II, one may alternatively obtain an estimate ˆ ϕ s directly from the upper left element of ˆ Ψ x e [12]. During speech pauses,however, where ˆ Ψ x e deviates from zero due to modeling and estimation errorsonly, the estimator in (21) is more robust. . Instantaneous Principal Component-based Implementation In order to obtain instantaneous eigenspace-based PSDestimates while still relying on recursive averaging as in(22) with a sufﬁciently large time constant τ , we propose tocompute instantaneous generalized eigenvalues ˆ λ y | inst basedon generalized principal components instead of using thesmooth generalized eigenvalues ˆ λ y | sm directly.In order to introduce the generalized principal componentsand establish its relation to the generalized eigenvalues, letus reconsider the GEVD in (11)–(13). From the generalizedeigenvectors in P , we can deﬁne the generalized principalcomponents of y as ϑ = P H y . (24)Note that with Ψ y = E[ yy H ] , the generalized principalcomponents in (24) are related to the generalized eigenvaluesin (12) by λ y = diag (cid:2) E[ ϑϑ H ] (cid:3) . (25)Now, assume that we have obtained ˆ Ψ y | sm and its generalizedeigenvectors in ˆ P as described in Sec. IV-C. Then, with ˆ ϑ =ˆ Py , we deﬁne the instantaneous generalized eigenvalues ˆ λ y | inst = diag[ˆ ϑ ˆ ϑ H ] , (26)which maintain non-stationarities as they directly depend onthe microphone signal y , cf. (24). Based on ˆ λ y | inst , we can thenobtain instantaneous PSD estimates ˆ ϕ s | inst and ˆ ϕ d | inst accordingto Sec. IV-B.Note that we may also establish a relation between theinstantaneous generalized eigenvalues in (26) and the smoothgeneralized eigenvalues obtained in Sec. IV-C. With ˆ λ y | sm =diag[ ˆ P H ˆ Ψ y | sm ˆ P ] according to (12), inserting ˆ Ψ y | sm from (22)and using (24), (26), we ﬁnd ˆ λ y | sm ( l ) = ζ diag[ ˆ P ( l ) H ˆ Ψ y | sm ( l −

1) ˆ P ( l )]+ (1 − ζ )ˆ λ y | inst ( l ) , (27)where any time variations in ˆ P ( l ) are due to model-ing and estimation errors only, cf. Sec. IV-C, such that diag[ ˆ P ( l ) H ˆ Ψ y | sm ( l −

1) ˆ P ( l )] ≈ ˆ λ y | sm ( l − . The smoothgeneralized eigenvalues ˆ λ y | sm therefore nearly correspond to arecursive average of the instantaneous generalized eigenvalues ˆ λ y | inst . V. S IMULATIONS

In this section, we compare the speech enhancement perfor-mance of the MWF with smooth PSD estimates ˆ ϕ s | sm , ˆ ϕ d | sm according to Sec. IV-C and the MWF with instantaneous PSDestimates ˆ ϕ s | inst , ˆ ϕ d | inst according to Sec. IV-D as a functionof the time constant τ .In our simulations, we use a linear array of M = 5 micro-phones spaced by . In total, scenarios are generated.The source is positioned away at an angle of { , , } ◦ relative to the broadside direction of the microphone array,where sound propagation is modeled using measured room . ζ . . . PE S Q . . S T O I S I R f w s / d B .

25 0 . .

75 1 1 .

25 1 . . τ / s C D / d B Fig. 1: The forgetting factor ζ [ ] and the performancemeasures PESQ , STOI , SIR fws , and CD versus τ forthe ﬁrst microphone signal [ ], the MVDR [ ], theMWF with smooth PSD estimates [ ] and the MWF withinstantaneous PSD estimates [ ]. The graphs denote themedian scores over all scenarios, the shaded areas indicatethe range from the ﬁrst to the third quartile.impulse responses (RIRs) [20] of .

61 s reverberation time.In each source position, both male and female speech areused as source signals, where we select sections of

10 s from each of the source signal ﬁles [21]. Diffuse babblenoise [22], [23] is added at a signal-to-noise ratio (SNR) of , where the SNR is deﬁned as the power ratio of x and v in the time domain. The sampling rate is f s = 16 kHz .The STFT processing uses square root Hann windows of samples with R = 256 samples overlap. The presumedavailable estimates of the RETFs in h are generated based onthe directions of arrival, i.e. the estimate corresponds to thefree-ﬁeld steering vector. We measure performance in termsof the perceptual evaluation of speech quality PESQ [24]with mean opinion scores ∈ [1 , . , the short-time objectiventelligibility STOI [25] with scores ∈ [0 , , the frequency-weighted segmental signal-to-interference ratio SIR fws [1]in dB and the cepstral distance CD [1] in dB . The cleanreference signal is generated by convolving the speech sourcesignal with the early part of the RIR to the ﬁrst microphone.The computed measures are averaged over all scenarios.Fig. 1 reports the simulation results. As to be expected,both versions of the MWF [ , ] outperform the MVDR[ ], which in turn shows some improvement over theunprocessed microphone signal [ ]. The two versions ofthe MWF however show a different behavior. The MWFwith smooth PSD estimates [ ] reaches a fairly sharpperformance peak at relatively low values of τ , with decreasingperformance for larger values, where the spectral gain in (10)becomes less time-variant. This behavior is explained by thefact that when computing smooth PSD estimates according toSec. IV-C, the time constant τ trades off the accuracy of theeigenspace basis estimate ˆ P on the one hand and the degreeof non-stationarity maintained in the PSD estimates ˆ ϕ s | sm and ˆ ϕ d | sm on the other hand. The MWF with instantaneous PSDestimates [ ] in contrast shows a monotonous performanceincrease in τ , which facilitates tuning. This is explained bythe fact that when computing instantaneous PSD estimatesaccording to Sec. IV-D, the accuracy of the eigenspace basisestimate ˆ P still increases with τ , while the instantaneousPSD estimates ˆ ϕ s | inst and ˆ ϕ d | inst maintain non-stationaritiesindependently of τ . At large values of τ , in all measures,the improvement with respect to the MVDR is more thantwice as large for MWF with instantaneous PSD estimates ascompared to the MWF with smooth PSD estimates. Note thatin a spatially dynamic scenario with time-varying RETF h , theeigenspace basis P becomes time-variant, in which case theperformance of the MWF with instantaneous PSD estimatesmight possibly not increase monotonically in τ anymore, butwill presumably show a peak depending on the pace of RETFvariations. VI. C ONCLUSION

In this paper, as an alternative to smooth PSD estimationbased on smooth generalized eigenvalues, we have proposedan instantaneous PSD estimation approach based on general-ized principal components. The instantaneous PSD estimatesmaintain non-stationarities and hence potentially outperformsmooth PSD estimates for speech enhancement, as exemplarilyshown for the MWF. R

EFERENCES[1] P. C. Loizou,

Speech enhancement: theory and practice . CRC press,2007.[2] S. Doclo, S. Gannot, M. Moonen, and A. Spriet, “Acoustic beamformingfor hearing aid applications,” in

Handbook on Array Processing andSensor Networks . Wiley, 2010, pp. 269–302.[3] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consol-idated perspective on multimicrophone speech enhancement and sourceseparation,”

IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 25,no. 4, pp. 692–730, Apr. 2017.[4] I. Cohen, “Noise spectrum estimation in adverse environments: improvedminima controlled recursive averaging,”

IEEE Trans. Audio, Speech,Lang. Process. , vol. 11, no. 5, pp. 466–475, Sep. 2003. [5] R. C. Hendriks, J. J. Jensen, and R. Heusdens, “Noise tracking usingDFT domain subspace decompositions,”

IEEE Trans. Audio, Speech,Lang. Process. , vol. 16, no. 3, pp. 541–553, Mar. 2008.[6] A. H. Kamkar-Parsi and M. Bouchard, “Instantaneous binaural targetPSD estimation for hearing aid noise reduction in complex acousticenvironments,”

IEEE Trans. Instrument. Meas. , vol. 60, no. 4, pp. 1141–1154, Apr. 2011.[7] A. Kuklasi´nski, S. Doclo, S. H. Jensen, and J. Jensen, “Maximumlikelihood PSD estimation for speech enhancement in reverberation andnoise,”

IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 24, no. 9,pp. 1599–1612, Sep. 2016.[8] O. Schwartz, S. Gannot, and E. A. P. Habets, “Joint estimation of latereverberant and speech power spectral densities in noisy environmentsusing Frobenius norm,” in

Proc. 24th European Signal Process. Conf.(EUSIPCO 2016) , Budapest, Hungary, Aug. 2016, pp. 1123–1127.[9] S. Braun, A. Kuklasi´nski, O. Schwartz, O. Thiergart, E. A. P. Habets,S. Gannot, S. Doclo, and J. Jensen, “Evaluation and comparison oflate reverberation power spectral density estimators,”

IEEE/ACM Trans.Audio, Speech, Lang. Process , vol. 26, no. 6, pp. 1056–1071, June 2018.[10] I. Kodrasi and S. Doclo, “Analysis of eigenvalue decomposition-basedlate reverberation power spectral density estimation,”

IEEE/ACM Trans.Audio, Speech, Lang. Process. , vol. 26, no. 6, pp. 1102–1114, June 2018.[11] A. I. Koutrouvelis, R. C. Hendriks, R. Heusdens, and J. Jensen,“Robust joint estimation of multi-microphone signal model parameters,”

IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 27, no. 7, pp.1136–1150, July 2019.[12] T. Dietzen, S. Doclo, M. Moonen, and T. van Waterschoot, “Square root-based multi-source early PSD estimation and recursive RETF updatein reverberant environments by means of the orthogonal Procrustesproblem,”

IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 28,pp. 755 – 769, Jan. 2020.[13] N. Li and P. C. Loizou, “Factors inﬂuencing intelligibility of idealbinary-masked speech: Implications for noise reduction,”

J. Acoust. Soc.Amer. , vol. 123, no. 3, pp. 1673–1682, Mar. 2008.[14] D. L. Wang, U. Kjems, M. S. Pedersen, J. B. Boldt, and T. Lunner,“Speech intelligibility in background noise with ideal binary time-frequency masking,”

J. Acoust. Soc. Amer. , vol. 125, no. 4, pp. 2336–2347, Apr. 2009.[15] A. Narayanan and D. Wang, “Ideal ratio mask estimation using deepneural networks for robust speech recognition,” in

Proc. 2004 IEEE Int.Conf. Acoust., Speech, Signal Process. (ICASSP 2013) , Vancouver, BC,Canada, May 2013, pp. 7092–7096.[16] T. Dietzen, S. Doclo, M. Moonen, and T. van Waterschoot, “Integratedsidelobe cancellation and linear prediction Kalman ﬁlter for joint multi-microphone dereverberation, interfering speech cancellation, and noisereduction,”

IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 28,pp. 740 – 754, Jan. 2020.[17] T. Dietzen, “GitHub repository: instantaneous PSD estimation for speechenhancement based on generalized principal components,” https://github.com/tdietzen/INSTANT-PSD, Mar. 2020.[18] S. Markovich-Golan and S. Gannot, “Performance analysis of the co-variance subtraction method for relative transfer function estimation andcomparison to the covariance whitening method,” in

Proc. 2015 IEEEInt. Conf. Acoust., Speech, Signal Process. (ICASSP 2015) , Brisbane,QLD, Australia, Apr. 2015, pp. 544–548.[19] F. Jacobsen and T. Roisin, “The coherence of reverberant sound ﬁelds,”

J. Acoust. Soc. Amer. , vol. 108, no. 1, pp. 204–210, July 2000.[20] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audiodatabase in various acoustic environments,” in

Proc. 2014 Int. WorkshopAcoustic Signal Enhancement (IWAENC 2014) , Antibes – Juan les Pins,France, Sept. 2014, pp. 313–317.[21] Bang and Olufsen, “Music for Archimedes,” Compact Disc B&O, 1992.[22] E. A. P. Habets, I. Cohen, and S. Gannot, “Generating nonstationarymultisensor signals under a spatial coherence constraint,”

J. Acoust. Soc.Amer. , vol. 124, no. 5, pp. 2911–2917, Nov. 2008.[23] Auditec, “Auditory tests (revised),” Compact Disc Auditec, 1997.[24] ITU-T, “Perceptual evaluation of of speech quality (PESQ): An objectivemethod for end-to-end speech quality assessment of narrowband tele-phone networks and speech codecs,” in

ITU-T Recommendation P.862,Int. Telecommun. Union , Geneva, Switzerland, Feb. 2001.[25] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithmfor intelligibility prediction of time-frequency weighted noisy speech,”