[PDF] A Robust Maximum Likelihood Distortionless Response Beamformer based on a Complex Generalized Gaussian Distribution

Abstract

For multichannel speech enhancement, this letter derives a robust maximum likelihood distortionless response beamformer by modeling speech sparse priors with a complex generalized Gaussian distribution, where we refer to as the CGGD-MLDR beamformer. The proposed beamformer can be regarded as a generalization of the minimum power distortionless response beamformer and its improved variations. For narrowband applications, we also reveal that the proposed beamformer reduces to the minimum dispersion distortionless response beamformer, which has been derived with the {{\ell}_{p}}-norm minimization. The mechanisms of the proposed beamformer in improving the robustness are clearly pointed out and experimental results show its better performance in PESQ improvement.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, FEBRUARY 2021 1

A Robust Maximum Likelihood DistortionlessResponse Beamformer based on a ComplexGeneralized Gaussian Distribution

Weixin Meng, Chengshi Zheng, and Xiaodong Li

Abstract —For multichannel speech enhancement, this letterderives a robust maximum likelihood distortionless responsebeamformer by modeling speech sparse priors with a complexgeneralized Gaussian distribution, where we refer to as theCGGD-MLDR beamformer. The proposed beamformer can beregarded as a generalization of the minimum power distortionlessresponse beamformer and its improved variations. For narrow-band applications, we also reveal that the proposed beamformerreduces to the minimum dispersion distortionless response beam-former, which has been derived with the (cid:96) p -norm minimization.The mechanisms of the proposed beamformer in improving therobustness are clearly pointed out and experimental results showits better performance in PESQ improvement. Index Terms —Adaptive beamforming, maximum-likelihood es-timation, complex generalized Gaussian distribution.

I. I

NTRODUCTION M ICROPHONE array beamforming has been widely usedto extract the desired speech and suppress both theinterferences and the noise for speech communication andautomatic speech recognition (ASR). Typically, there are twotypes of microphone array beamforming algorithms, whereone is data-independent ﬁxed beamformers and the other isdata-dependent adaptive beamformers. Generally, the adaptivebeamformers are more powerful in suppressing directionalinterferences adaptively than the ﬁxed beamformers. Thereare many well-known adaptive beamformers including theminimum power distortions response (MPDR) beamformer[1]–[3], the generalized sidelobe cancellation (GSC) [4]–[6]and the multi-channel Wiener ﬁltering (MWF) [7]–[13]. Boththe MPDR beamformer and the GSC are quite sensitive tothe steering vector error of the desired speech, while theperformance of the MWF depends on the estimation accuracyof the second-order statistics of the desired speech and theinterference-plus-noise component. Among these beamform-ers, the MPDR beamformer and its variations are the mostwidely studied and applied.There are at least two ways to improve the robustnessof the MPDR beamformer, where one is to improve theestimation accuracy of the steering vector of the desired speech[14]–[19] and the other is to estimate the interference-plus-noise power spectral density (PSD) matrix to replace thenoisy PSD matrix [20]–[25]. For practical applications, thesetwo ways can be combined together to further improve the

The authors are with the Key Laboratory of Noise and Vibration Research,Institute of Acoustics, Chinese Academy of Science, Beijing, 100190, China,and also with University of Chinese Academy of Sciences, Beijing, 100049,China (email: [email protected])Manuscript received February XX, 2021; revised XXXX XX, XX. performance. Whereas one cannot expect that the steeringvector can be estimated accurately and the interference-plus-noise PSD matrix does not contain any desired speech PSDmatrix, especially in low signal-to-interference-plus-noise ratio(SINR) conditions. This paper focuses on improving therobustness of the MPDR beamformer without estimating theinterference-plus-noise PSD matrix explicitly.When assuming that the desired speech in the frequencydomain follows a complex Gaussian distribution (CGD) withtime-varying variances, a maximum likelihood distortionlessresponse (MLDR) beamformer has been derived in [26]and it reduces the word error rates for ASR. When con-sidering the signal, interferences and the noise are non-Gaussian distributed, a minimum dispersion distortionless re-sponse (MDDR) beamformer has been derived with the (cid:96) p -norm minimization for narrowband applications [27], [28].The relationship between the MLDR beamformer and theMPDR beamformer has not been revealed clearly and itsmechanism in improving the performance needs to be furtherclariﬁed. Moreover, the best choice of p in MDDR is notstraightforward, which also needs to be studied in a moretheoretical way.In this letter, we derive a robust maximum likelihood distor-tionless response beamformer by introducing a complex gen-eralized Gaussian distribution to model speech sparse priors[29], [30], which is refer to as the CGGD-MLDR beamformer.One can see that the proposed beamformer is a generalizationof the MPDR beamformer and it can reduce to many existingvariations of the MPDR beamformer. This letter also shows themechanism of the CGGD-MLDR beamformer in improvingthe robustness of the MPDR beamformer. Finally, we proposean iterative optimization algorithm to obtained the optimalweight vector of the proposed beamformer. Experimentalresults show that the proposed beamformer can achieve betterperformance by properly choosing the shape parameter p .The remainder of this letter is organized as follows. SectionII presents problem formulation and related work. In sectionIII, we derive the CGGD-MLDR beamformer and study its re-lationship with the MPDR beamformer and its variations, andthe mechanism in improving the robustness is also presented.In section IV, we study the performance of the CGGD-MLDRbeamformer, and compare it with the MLDR and the MPDRbeamformers. Section V gives some conclusions.II. P ROBLEM F ORMULATION AND R ELATED W ORK

We assume that a desired speech source and some indepen-dent directional noise sources impinge on an arbitrary shape a r X i v : . [ ee ss . A S ] F e b OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, FEBRUARY 2021 2 microphone array consisting of M microphones. By applyingthe short-time Fourier transform (STFT), the microphonesignals can be written into vectors of M so that y ( k, l ) =[ Y ( k, l ) , ..., Y M ( k, l )] T , where k denotes the frequency indexand l denotes the frame index, we have y ( k, l ) = h ( k ) S ( k, l ) + v ( k, l )= x ( k, l ) + v ( k, l ) , (1)where S ( k, l ) denotes the desired speech; h ( k ) denotesthe acoustic transform function (ATF) vector of the desiredspeech; v ( k, l ) denotes the interference-plus-noise vector.The objective of a beamforming is to design a spatial ﬁlter w ( k ) , which can be applied to extract the desired speech: (cid:98) S ( k, l ) = w H ( k ) y ( k, l )= w H ( k ) h ( k ) S ( k, l ) + w H ( k ) v ( k, l ) . (2)The well-known MPDR beamformer aims to minimize theoutput power with the distortionless constraint on the desireddirection, which can be given by: min w ( k ) E (cid:8) | w H ( k ) y ( k, l ) | (cid:9) s.t. w H ( k ) h ( k ) = 1 . (3)The close-formed solution of (3) can be written as w MPDR ( k ) = R − yy ( k ) h ( k ) h H ( k ) R − yy ( k ) h ( k ) , (4)where R yy ( k ) = E (cid:110) y ( k, l ) y H ( k, l ) (cid:111) = λ s ( k ) Υ ss ( k ) + λ v ( k ) Υ vv ( k ) (5)denotes the noisy PSD matrix; λ s ( k ) denotes the PSD of thedesired speech; Υ ss ( k ) denotes the desired speech correlationmatrix; λ v ( k ) denotes the PSD of the interference-plus-noisesignal and Υ vv ( k ) denotes the interference-plus-noise corre-lation matrix. In practice, the noisy PSD matrix needs to bereplaced by using its sample covariance matrix, which can begiven by: MPDR (cid:98) R yy ( k ) = L (cid:88) l =1 y ( k, l ) y H ( k, l ) . (6)Accordingly, the estimated optimal weight vector of theMPDR beamformer can be expressed as (cid:98) w MPDR ( k ) = (cid:16) MPDR (cid:98) R − yy ( k ) (cid:17) h ( k ) h H ( k ) (cid:16) MPDR (cid:98) R − yy ( k ) (cid:17) h ( k ) . (7)It is well-known that the MPDR beamformer is sensitive tothe estimation error of the ATF error and the desired speechcancellation problem often occurs. To improve its robustness,the noisy PSD matrix should be replaced by the interference-plus-noise PSD matrix. For this purpose, one needs to dis-tinguish the interference-plus-noise-only time-frequency binsfrom the noisy bins or the speech presence probability in eachtime-frequency bin needs to be estimated before updating theinterference-plus-noise PSD matrix.In [26], the MLDR beamformer is derived and its optimalweight vector has the similar form as the MPDR beamformer,where the only difference is that the noisy PSD matrix inthe MPDR beamformer is replaced by a weighted sample covariance matrix, which can be given by: CGD (cid:98) R yy ( k ) = L (cid:88) l =1 y ( k, l ) y H ( k, l ) λ s ( k, l ) , (8)where λ s ( k, l ) = E (cid:8) | S ( k, l ) | (cid:9) indicates the PSD of thedesired speech in the k th bin of the l th frame. Note that | S ( k, l ) | is often unknown prior and it needs to be replacedby its estimated value, i.e., (cid:98) λ s ( k, l ) = | (cid:98) S ( k, l ) | should beused for practical applications.III. M ETHOD

A. MLDR with a Complex Generalized Gaussian Distribution

We assume that S ( k, l ) follows a zero-mean complex gen-eralized Gaussian distribution, where its probability densityfunction can be expressed as: ρ ( S ( k, l )) = p πγ Γ (2 /p ) e − | S ( k,l ) | pγp/ , (9)where γ > is the scale parameter, p is the shape parameterof this complex generalized Gaussian distribution, and Γ ( · ) isthe Gamma function. The complex generalized Gaussian canbe divided into three groups including super-Gaussian, Gaus-sian, and sub-Gaussian, respectively. In this letter, the super-Gaussian distribution is considered for speech applications, i.e. < p < . In this case, ρ ( S ( k, l )) can further be representedas a maximization over scaled Gaussian distributions withdifferent variances, which can be expressed as: ρ ( S ( k, l )) = max λ s ( k,l ) > N C ( S ( k, l ); 0 , λ s ( k, l )) ψ ( λ s ( k, l )) , (10)where N C ( S ( k, l ); 0 , λ s ( k, l )) denotes a complex Gaus-sian distribution with zero-mean and time-varying variances λ s ( k, l ) ; ψ ( · ) denotes a scaling function which is related to λ s ( k, l ) . With this model, the weight vector w ( k ) should beoptimized by maximizing the following likelihood function: max L (cid:89) l =1 max λ s ( k,l ) > N C ( S ( k, l ); 0 , λ s ( k, l )) ψ ( λ s ( k, l )) s.t. w H ( k ) h ( k ) = 1 (11)which is equivalent to minimize the negative log-likelihoodwith the weight vector w ( k ) and λ s ( k, l ) , which can beexpressed as: min λ s ( k,l ) > L (cid:88) l =1 (cid:32) | S ( k, l ) | λ s ( k, l ) + log ( πλ s ( k, l )) − log ψ ( λ s ( k, l )) (cid:33) s.t. w H ( k ) h ( k ) = 1 . (12)It is worth noting that the optimization problem is correlatedto the desired speech S ( k, l ) , which is unknown prior and itis the estimation target of the beamformer. We replace S ( k, l ) with its estimation (cid:98) S ( k, l ) . A Lagrange multiplier methodcan be used to solve this optimization problem, and the costfunction can be written as J k = L (cid:88) l =1 g ( w ( k ) , λ s ( k, l )) + α k ( w H ( k ) h ( k ) − , (13) OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, FEBRUARY 2021 3 where g ( w ( k ) , λ s ( k, l )) = (cid:12)(cid:12)(cid:12) (cid:98) S ( k, l ) (cid:12)(cid:12)(cid:12) λ s ( k, l ) + log ( πλ s ( k, l )) − log ψ ( λ s ( k, l )) , (14)and α k denotes the Lagrange multiplier. The optimizationproblem need to optimize λ s ( k, l ) and w ( k ) simultaneously,which means that we cannot get a closed-form solution. Inthis letter, we propose an iterative optimization algorithm tosolve this problem and the updating rule can be obtained bysetting partial differentials of the cost function with respectto the corresponding parameters at zero. Similar to [31], bysetting the partial differentials of J k with respect to λ s ( k, l ) at zero, one can get λ s ( k, l ) = 2 γ p/ p (cid:12)(cid:12)(cid:12) (cid:98) S ( k, l ) (cid:12)(cid:12)(cid:12) − p , (15)when λ s ( k, l ) is determined, w ( k ) is the solution of min λ s ( k,l ) > L (cid:88) l =1 (cid:12)(cid:12)(cid:12) (cid:98) S ( k, l ) (cid:12)(cid:12)(cid:12) λ s ( k, l ) s.t. w H ( k ) h ( k ) = 1 . (16)Finally, the solution of (16) can be given by (cid:98) w CGGD ( k ) = (cid:16) CGGD (cid:98) R yy ( k ) (cid:17) − h ( k ) h H (cid:16) CGGD (cid:98) R yy ( k ) (cid:17) − h ( k ) , (17)where CGGD (cid:98) R yy ( k ) = L (cid:88) l =1 y ( k, l ) y H ( k, l ) λ s ( k, l ) = L (cid:88) l =1 p y ( k, l ) y H ( k, l )2 γ p/ (cid:12)(cid:12)(cid:12) (cid:98) S ( k, l ) (cid:12)(cid:12)(cid:12) − p . (18)Because (cid:98) w CGGD ( k ) is invariant to the constant scaling factorin (15), (18) can be further reduced to CGGD (cid:98) R yy ( k ) = L (cid:88) l =1 y ( k, l ) y H ( k, l ) (cid:98) λ (1 − p/ s ( k, l ) , (19)where (cid:98) λ s ( k, l ) = | (cid:98) S ( k, l ) | denotes the estimated PSD of thedesired speech. To avoid dividing by zero, a small positiveﬂoor value δ ≥ in the denominator should be introduced toimprove the stability of the proposed beamformer.In the iterative optimization algorithm, with the initialization (cid:98) w CGGD ( k ) = (cid:98) w MPDR ( k ) , we keep updating (cid:98) λ s ( k, l ) and (cid:98) w CGGD ( k ) until reaching the maximum number of iterations.The whole algorithm is summarized in Algorithm 1. B. Related to the MPDR Beamfomer and its Variations

This part discuss the relationship between the proposedCGGD-MLDR beamformer and the MPDR beamformer to-gether with its improved variations. The proposed CGGD-MLDR beamformer is a generalized MPDR beamformer,where it can reduce to many existing improved versions of theMPDR beamformer for different values of the shape parameter p in (9). Obviously, we can have the following comments:1) When p = 2 , the proposed CGGD-MLDR beamformerreduces to the well-known MPDR beamformer due to CGGD (cid:98) R yy ( k ) ≡ (cid:98) R yy ( k ) because of (cid:98) λ (1 − p/ s ( k, l ) ≡ . Algorithm 1

CGGD-MLDR

Input: y ( k, l ) , p , and maximum iteration number I Output: (cid:98) w I CGGD ( k ) and (cid:98) S I ( k, l ) Initialize: (cid:98) w CGGD ( k ) = (cid:98) w MPDR ( k ) for i = 0 , , , ..., I − do for l = 1 , , ..., L do Compute (cid:98) S i ( k, l ) = ( (cid:98) w i CGGD ( k )) H y ( k, l ) Update (cid:98) λ i +1 s ( k, l ) = (cid:12)(cid:12)(cid:12) (cid:98) S i ( k, l ) (cid:12)(cid:12)(cid:12) − p Compute CGGD (cid:98) R i +1 yy ( k ) = CGGD (cid:98) R i +1 yy ( k ) + y ( k,l ) y H ( k,l ) (cid:98) λ i +1 s ( k,l ) Update (cid:98) w i +1 CGGD ( k ) = ( CGGD (cid:98) R i +1 yy ( k ) ) − h ( k ) h H ( k ) ( CGGD (cid:98) R i +1 yy ( k ) ) − h ( k ) return (cid:98) w I CGGD ( k ) and (cid:98) S I ( k, l )

2) When p = 0 , the proposed CGGD-MLDR beamformerbecomes the newly proposed MLDR beamformer [26]due to CGGD (cid:98) R yy ( k ) ≡ CGD (cid:98) R yy ( k ) for p = 0 .3) When p is a positive value, for narrowband applications,the proposed CGGD-MLDR beamformer has the sameform as the MDDR beamformer that is derived usingthe (cid:96) p -norm minimization [27]. Note that the proposedCGGD-MLDR beamformer gives clear guidelines onchoosing the shape parameter p in (9), this is becauseit is derived from the maximum likelihood theory and p relates to the distribution of the desired speech. C. Mechanisms of the Proposed CGGD-MLDR Beamformerin Improving Robustness

We assume that there are L interference-plus-noise-onlyframes among all L frames. And, we further assume that L and L is large enough so that the inter-correlation between x ( k, l ) and v ( k, l ) can be ignored and λ s ( k, l ) and λ v ( k, l ) do not change over time. Accordingly, (19) becomes CGGD R yy ( k ) = L ( λ s ( k )) p Υ ss ( k )+ (cid:18) L ρ ( k ) + L ( λ s ( k )) p ε ( k ) (cid:19) Υ vv ( k ) , (20) where ρ ( k ) = λ v ( k ) (cid:14) δ (1 − p/ and ε ( k ) = λ s ( k ) /λ v ( k ) isthe input SINR. L = L + L with L the number of framescontaining desired speech. One can see that CGGD R yy ( k ) is alinear combination of Υ ss ( k ) and Υ vv ( k ) , we further deﬁnethe ratio of the two combination coefﬁcients, which is r p ( k ) = L ρ ( k ) + L ( λ s ( k )) p ε ( k ) L ( λ s ( k )) p . (21)When p = 2 , we have r ( k ) = L /( L ε ( k )) , and thus r ( k ) is determined by the input SINR and the number of thedesired speech frames among all L noisy frames. Note that thesmaller r ( k ) is, the more sensitive the MPDR beamformeris. When p = 0 , we have r ( k ) = ( L ρ ( k ) + L / ε ( k ))/ L .Obviously, when ρ ( k ) ε ( k ) ≥ , i.e., λ s ( k ) ≥ δ , r ( k ) ≥ r ( k ) holds true always. This should be the reason that theMLDR beamformer can improve the robustness of the MDPRbeamformer. For arbitrary values of p ∈ [0 2) , one can easily In [32], the MLDR was also called the weighted MPDR (wMPDR).

OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, FEBRUARY 2021 4

Iteration P E S Q (a) MPDRMVDR PRO p =0.25 PRO p =1PRO p =0.5 MLDR -5 0 5 10 SINR (dB) P E S Q (b) MPDRMVDR MLDR I=2 PRO p =0.5 I=2MLDR I=5 PRO p =0.5 I=5 RT (ms) P E S Q (c) MPDRMVDR MLDR I=2 PRO p =0.5 I=2MLDR I=5 PRO p =0.5 I=5 Fig. 1. Performance evaluation. (a) PESQ improvements versus the number of iterations. (b) PESQ improvements versus the input SINR. (c) PESQimprovements versus the reverberation time. ”PRO” represents the proposed CGGD-MLDR beamformer for simplicity. derive that r p ( k ) ≥ r ( k ) holds true if and only if λ s ( k ) ≥ δ , which means that the CGGD-MLDR beamformer can bealways more robust than the MPDR beamformer due to that δ is only a small positive value as mentioned above.IV. E XPERIMENT R ESULTS

This section evaluates the performance of the proposedCGGD-MLDR beamformer and compares it with the MPDRbeamformer and the newly developed MLDR beamformer. Theperformance of the oracle minimum variance distortionlessresponse (MVDR) beamformer is also presented to show thetheoretical limit, where the interference-plus-noise PSD matrixis assumed to be known exactly. Ten 20s speech signals aretaken from TIMIT corpus [33] and the babble noise is chosenform NOISEX-92 database [34] and is split into multiplesegments as interferences. The room impulse response isgenerated by using the image method [35], with a room ofsize m × m × m . The reverberation time ranges from0 to 640 ms with the interval 160 ms . We consider a uniformlinear array with microphones and cm inter-sensor distancewhich is placed at the center of the room. The desired speech is m away from the array center propagating from θ = 0 ◦ , andtwo interferences propagate from ◦ and − ◦ , respectively.In this evaluation, the PESQ improvement [36] is chosen asan objective measurement. A. Performance versus Iteration

In the ﬁrst experiment, we examine the performance of thebeamformers versus the number of iterations with the inputSINR=0dB and the reverberation time 160 ms . From Fig. 1(a), one can see that the choice of p seriously affects theperformance of the CGGD-MLDR beamformer. Among them,it has the highest PESQ improvement with p = 0 . and itcan gradually converge to the oracle MVDR, while MPDRhave the lowest PESQ improvement. Moreover, one can seethe CGGD-MLDR beamformer with p = 0 . provides muchhigher PESQ improvement using only very limited iterations,e.g., 2 to 3 iterations, which means that it has much fasterconvergence rate than the MLDR beamformer. B. Performance versus Input SINR

In the following experiments, we only present the results ofthe proposed beamformer with p = 0 . and compare it with the MLDR and the MPDR beamformers for its best performance.Fig. 1 (b) plots the PESQ improvements versus the input SINRranging from -5dB to 10dB with the reverberation time 160 ms .We can see that the PESQ improvement reduces as the increaseof the input SINR for all beamformers and the proposedCGGD-MLDR beamformer with p = 0 . is much better thanthe other beamformers. For the MPDR beamformer, the PESQimprovement can be negative due to that the target speechcancellation problem occurs in high input SINR conditions. C. Performance versus Reverberation Time

In the third experiment, we study the performance ofCGGD-MLDR with RT = [0 , , , , ms . It isshown in Fig. 1 (c) that, in low RT scenarios, e.g 0to 160 ms , the proposed beamformer is signiﬁcantly betterthan the MLDR beamformer, while with the increase ofRT , the performance of all beamformers decreases andtheir performance becomes similar. When the reverberationtime increases, it is well-known that the MPDR beamformerand its variations decrease their performance due to thatthe directional interferences have multiple strong reﬂections,which cannot be well suppressed by imposing nulls.V. C ONCLUSION

When a complex generalized Gaussian distribution is in-troduced to model the desired speech, we derive the CGGD-MLDR beamformer with the maximum likelihood criterion,which is a generalization of the MPDR and the MLDRbeamformers. By properly choosing the shape parameter p ,the proposed beamformer can achieve better performance thanthe MLDR beamformer in PESQ. The most attractive aspectis that the proposed beamformer with p = 0 . can converge ina much faster way than the MLDR beamformer, and thus thecomputational complexity can be decreased dramatically dueto that the proposed beamformer needs much fewer iterationsto achieve the same performance of the MLDR beamformer.Future work can concentrate on jointly optimizing denoisingand dereverberation with the complex generalized Gaussiandistribution of the desired speech. OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, FEBRUARY 2021 5 R EFERENCES[1] J. Capon, “High-resolution frequency-wavenumber spectrum analysis,”

Proceedings of the IEEE , vol. 57, no. 8, pp. 1408–1418, 1969.[2] H. Cox, “Resolving power and sensitivity to mismatch of optimum arrayprocessors,”

The Journal of the acoustical society of America , vol. 54,no. 3, pp. 771–785, 1973.[3] L. Ehrenberg, S. Gannot, A. Leshem, and E. Zehavi, “Sensitivity analysisof mvdr and mpdr beamformers,” in , pp. 000416–000420,IEEE, 2010.[4] L. Grifﬁths and C. Jim, “An alternative approach to linearly constrainedadaptive beamforming,”

IEEE Transactions on Antennas and Propaga-tion , vol. 30, no. 1, pp. 27–34, 1982.[5] S. Gannot and I. Cohen, “Speech enhancement based on the generaltransfer function GSC and postﬁltering,”

IEEE Transactions on Speechand Audio Processing , vol. 12, no. 6, pp. 561–571, 2004.[6] R. Talmon, I. Cohen, and S. Gannot, “Convolutive transfer functiongeneralized sidelobe canceler,”

IEEE Transactions on Audio, Speech,and Language processing , vol. 17, no. 7, pp. 1420–1434, 2009.[7] A. Spriet, M. Moonen, and J. Wouters, “Spatially pre-processed speechdistortion weighted multi-channel wiener ﬁltering for noise reduction,”

Signal Processing , vol. 84, no. 12, pp. 2367–2387, 2004.[8] S. Doclo, A. Spriet, J. Wouters, and M. Moonen, “Speech distortionweighted multichannel wiener ﬁltering techniques for noise reduction,”in

Speech enhancement , pp. 199–228, Springer, 2005.[9] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domainmultichannel linear ﬁltering for noise reduction,”

IEEE Transactions onAudio, Speech, and Language Processing , vol. 18, no. 2, pp. 260–276,2009.[10] E. A. Habets, J. Benesty, and P. A. Naylor, “A speech distortion andinterference rejection constraint beamformer,”

IEEE Transactions onAudio, Speech, and Language Processing , vol. 20, no. 3, pp. 854–867,2011.[11] L. Wang, T. Gerkmann, and S. DoclO, “Noise power spectral densityestimation using MaxNSR blocking matrix,”

IEEE/ACM Transactions onAudio, Speech and Language Processing , vol. 23, no. 9, pp. 1493–1508,2015.[12] C. Zheng, A. Deleforge, X. Li, and W. Kellermann, “Statistical analysisof the multichannel wiener ﬁlter using a bivariate normal distribution forsample covariance matrices,”

IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 26, no. 5, pp. 951–966, 2018.[13] J. Benesty, C. Paleologu, C.-C. Oprea, and S. Ciochina, “An iterativemultichannel wiener ﬁlter based on a kronecker product decomposition,”in ,pp. 211–215, IEEE.[14] I. Cohen, “Relative transfer function identiﬁcation using speech signals,”

IEEE Transactions on Speech and Audio Processing , vol. 12, no. 5,pp. 451–459, 2004.[15] S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspace beam-forming in a reverberant noisy environment with multiple interferingspeech signals,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 17, no. 6, pp. 1071–1086, 2009.[16] K. Reindl, S. Markovich-Golan, H. Barfuss, S. Gannot, and W. Keller-mann, “Geometrically constrained TRINICON-based relative transferfunction estimation in underdetermined scenarios,” in , pp. 1–4, IEEE, 2013.[17] C. Pan, J. Chen, and J. Benesty, “Performance study of the mvdrbeamformer as a function of the source incidence angle,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 22, no. 1,pp. 67–79, 2013.[18] S. Markovich-Golan, S. Gannot, and W. Kellermann, “Performanceanalysis of the covariance-whitening and the covariance-subtraction[21] J. H. Ko, J. Fromm, M. Philipose, I. Tashev, and S. Zarar, “Limitingnumerical precision of neural networks to achieve real-time voiceactivity detection,” in , pp. 2236–2240, IEEE, 2018. methods for estimating the relative transfer function,” in , pp. 2499–2503,IEEE, 2018.[19] Y. Hu, T. D. Abhayapala, P. N. Samarasinghe, and S. Gannot, “De-coupled direction-of-arrival estimations using relative harmonic coef-ﬁcients,” in , pp. 246–250, IEEE, 2020.[20] R. C. Hendriks and T. Gerkmann, “Noise correlation matrix estimationfor multi-microphone speech enhancement,”

IEEE Transactions on Au-dio, Speech, and Language Processing , vol. 20, no. 1, pp. 223–233,2011.[22] Y. Gu and A. Leshem, “Robust adaptive beamforming based on interfer-ence covariance matrix reconstruction and steering vector estimation,”

IEEE Transactions on Signal Processing , vol. 60, no. 7, pp. 3881–3885,2012.[23] M. Taseska and E. A. Habets, “Nonstationary noise psd matrix estima-tion for multichannel blind speech extraction,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 25, no. 11, pp. 2223–2236, 2017.[24] A. I. Koutrouvelis, R. C. Hendriks, R. Heusdens, and J. Jensen,“Robust joint estimation of multimicrophone signal model parameters,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 27, no. 7, pp. 1136–1150, 2019.[25] C. Pan, J. Chen, and G. Shi, “On estimation of time-varying variances ofsource and noise for sensor array processing,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 28, p. 2865, 2020.[26] B. J. Cho, J.-M. Lee, and H.-M. Park, “A beamforming algorithmbased on maximum likelihood of a complex gaussian distribution withtime-varying variances for robust speech recognition,”

IEEE SignalProcessing Letters , vol. 26, no. 9, pp. 1398–1402, 2019.[27] X. Jiang, W.-J. Zeng, A. Yasotharan, H. C. So, and T. Kirubarajan,“Minimum dispersion beamforming for non-Gaussian signals,”

IEEETransactions on Signal Processing , vol. 62, no. 7, pp. 1879–1893, 2014.[28] L. Zhang, B. Li, L. Huang, T. Kirubarajan, and H. C. So, “Robustminimum dispersion distortionless response beamforming against fast-moving interferences,”

Signal Processing , vol. 140, pp. 190–197, 2017.[29] J. S. Erkelens, R. C. Hendriks, R. Heusdens, and J. Jensen, “Mini-mum mean-square error estimation of discrete Fourier coefﬁcients withgeneralized Gamma priors,”

IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 15, no. 6, pp. 1741–1752, 2007.[30] R. C. Hendriks, R. Heusdens, U. Kjems, and J. Jensen, “On optimalmultichannel mean-squared error estimators for speech enhancement,”

IEEE Signal Processing Letters , vol. 16, no. 10, pp. 885–888, 2009.[31] A. Juki´c, T. van Waterschoot, T. Gerkmann, and S. Doclo, “Multi-channel linear prediction-based speech dereverberation with sparsepriors,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 23, no. 9, pp. 1509–1520, 2015.[32] T. Nakatani, C. Boeddeker, K. Kinoshita, R. Ikeshita, M. Delcroix,and R. Haeb-Umbach, “Jointly optimal denoising, dereverberation, andsource separation,”

IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 28, pp. 2267–2282, 2020.[33] J. S. Garofolo, “TIMIT acoustic phonetic continuous speech corpus,”

Linguistic Data Consortium, 1993 , 1993.[34] A. Varga and H. J. Steeneken, “Assessment for automatic speechrecognition: II. NOISEX-92: A database and an experiment to studythe effect of additive noise on speech recognition systems,”

Speechcommunication , vol. 12, no. 3, pp. 247–251, 1993.[35] E. A. Habets, “Room impulse response generator,”

Technische Univer-siteit Eindhoven, Tech. Rep , vol. 2, no. 2.4, p. 1, 2006.[36] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per-ceptual evaluation of speech quality (PESQ)-a new method for speechquality assessment of telephone networks and codecs,” in2001 IEEEInternational Conference on Acoustics, Speech, and Signal Processing.Proceedings (Cat. No. 01CH37221)