[PDF] RTF-steered binaural MVDR beamforming incorporating multiple external microphones

Abstract

The binaural minimum-variance distortionless-response (BMVDR) beamformer is a well-known noise reduction algorithm that can be steered using the relative transfer function (RTF) vector of the desired speech source. Exploiting the availability of an external microphone that is spatially separated from the head-mounted microphones, an efficient method has been recently proposed to estimate the RTF vector in a diffuse noise field. When multiple external microphones are available, different RTF vector estimates can be obtained by using this method for each external microphone. In this paper, we propose several procedures to combine these RTF vector estimates, either by selecting the estimate corresponding to the highest input SNR, by averaging the estimates or by combining the estimates in order to maximize the output SNR of the BMVDR beamformer. Experimental results for a moving speaker and diffuse noise in a reverberant environment show that the output SNR-maximizing combination yields the largest binaural SNR improvement and also outperforms the state-of-the art covariance whitening method.

Full PDF

22019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

RTF-STEERED BINAURAL MVDR BEAMFORMING INCORPORATINGMULTIPLE EXTERNAL MICROPHONES

Nico G¨oßling, Wiebke Middelberg, Simon Doclo

University of Oldenburg, Department of Medical Physics and Acousticsand Cluster of Excellence Hearing4all, Oldenburg, Germany [email protected]

ABSTRACT

The binaural minimum-variance distortionless-response (BMVDR)beamformer is a well-known noise reduction algorithm that can besteered using the relative transfer function (RTF) vector of the desiredspeech source. Exploiting the availability of an external microphonethat is spatially separated from the head-mounted microphones, anefﬁcient method has been recently proposed to estimate the RTFvector in a diffuse noise ﬁeld. When multiple external microphonesare available, different RTF vector estimates can be obtained byusing this method for each external microphone. In this paper, wepropose several procedures to combine these RTF vector estimates,either by selecting the estimate corresponding to the highest inputSNR, by averaging the estimates or by combining the estimates inorder to maximize the output SNR of the BMVDR beamformer.Experimental results for a moving speaker and diffuse noise in areverberant environment show that the output SNR-maximizingcombination yields the largest binaural SNR improvement and alsooutperforms the state-of-the art covariance whitening method.

Index Terms — binaural noise reduction, relative transfer func-tion, external microphones, hearing devices

1. INTRODUCTION

Noise reduction algorithms for head-mounted assistive listeningdevices (e.g., hearing aids, earbuds, headsets) are crucial to improvespeech intelligibility and speech quality in noisy environments.Binaural noise reduction algorithms, which exploit the informationcaptured by all microphones on both sides of the head [1, 2],do not only allow to reduce unwanted sound sources but alsoallow to preserve the listener’s spatial impression of the acousticscene. As a well-known example, the binaural minimum-variancedistortionless-response (BMVDR) beamformer is able to preservethe binaural cues (i.e. the interaural time and level differences) ofa desired speech source [1–3]. For a moving speech source in areverberant environment, the BMVDR can be steered using the rel-ative transfer functions (RTFs) [4], which relate the acoustic transferfunctions between the desired speech source and all microphones tothe so-called reference microphones.To improve the performance of (binaural) algorithms in termsof noise reduction and source localization accuracy, it has beenproposed to use an external microphone in conjunction with thehead-mounted microphones [5–12]. For a diffuse noise ﬁeld, anefﬁcient RTF vector estimation method has been proposed in [11],

This work was funded by the Deutsche Forschungsgemeinschaft (DFG,German Research Foundation) - Project ID 352015383 (SFB 1330 B2) andProject ID 390895286 (EXC 2177/1). which exploits the spatial coherence (SC) properties of the noiseﬁeld. More speciﬁcally, the SC method assumes that the noisecomponent in the external microphone signal is uncorrelated withthe noise components in the head-mounted microphone signals.In this paper, we consider the more general scenario withmultiple external microphones. Using the SC method, each externalmicrophone yields a (different) RTF vector estimate, such that thequestion arises how to combine these RTF vector estimates. In theﬁrst procedure, we propose to select the RTF vector estimate corre-sponding to the external microphone with the highest narrowbandsignal-to-noise ratio (SNR). In the second procedure, we proposeto simply average the different RTF vector estimates. In the thirdprocedure, we propose to linearly combine the different RTF vectorestimates such that the narrowband output SNR of the BMVDR ismaximized. Experimental results of an on-line implementation ofthe BMVDR using recorded signals of a moving speaker and diffusenoise in a reverberant environment are provided. The results showthat the output SNR-maximizing combination of the SC-based RTFvector estimates leads to the largest binaural SNR improvementcompared to the other procedures and the state-of-the-art covariancewhitening method [13, 14].

2. CONFIGURATION AND NOTATION

Consider the binaural hearing device conﬁguration depicted inFigure 1, consisting of a left and a right hearing device (eachequipped with M D microphones), and M E external microphonesthat are spatially separated from the head-mounted microphones, i.e. M = 2 M D + M E microphones in total. In the frequency-domain,the m -th microphone signal of the left device can be written as y L ,m ( ω ) = x L ,m ( ω ) + n L ,m ( ω ) , m ∈ { , . . . , M D } , (1)with x L ,m ( ω ) the desired speech component and n L ,m ( ω ) the noisecomponent. For the sake of conciseness, the frequency ω will beomitted in the remainder of the paper. The m -th microphone signalof the right device y R ,m and the i -th external microphone signal y E ,i are deﬁned similarly as in (1). The M -dimensional microphonesignal vector, containing all microphone signals, is deﬁned as y = [ y L , , . . . , y L ,M D , y R , , . . . , y R ,M D , y E , , . . . , y E ,M E ] T , (2)with ( · ) T denoting the transpose operator. Using (1), the vector y can be written as y = x + n , (3)where the speech vector x and the noise vector n are deﬁnedsimilarly as in (2). Without loss of generality, the ﬁrst microphone a r X i v : . [ ee ss . A S ] A ug

019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY on each device is chosen as the reference microphone, i.e. y L = y L , = e T L y , y R = y R , = e T R y , (4)where e L and e R denote selection vectors consisting of zeros andone element equal to 1. Assuming a single desired speech source,the vector x can be written as x = a L x L = a R x R , (5)where a L and a R denote the M -dimensional RTF vectors of the de-sired speech source with respect to the reference microphones on theleft and the right device, respectively. It should be noted that one ofthe elements of the RTF vectors (corresponding to the reference mi-crophone) is equal to 1 and that the RTF vectors are related as a R = a L / e T R a L . The noisy input covariance matrix R y , the speech covari-ance matrix R x and the noise covariance matrix R n are deﬁned as R y = E{ yy H } , R x = E{ xx H } , R n = E{ nn H } , (6)where E{·} denotes the expectation operator and ( · ) H denotes theconjugate transpose. Assuming statistical independence betweenthe desired speech component and the noise component, the noisyinput covariance matrix is equal to R y = R x + R n .The output signals of the left and right devices are calculated byﬁltering and summing all microphone signals, i.e. the head-mountedmicrophone signals as well as the external microphone signals, usingthe complex-valued ﬁlter vectors w L and w R (see Figure 1), i.e. z L = w H L y , z R = w H R y . (7)The input SNR of the m -th microphone signal is given by the ratioof the input power spectral density (PSD) of the desired speechcomponent and the input PSD of the noise component, i.e. SNR in m = e Tm R x e m e Tm R n e m = e Tm R y e m e Tm R n e m − , (8)with e m an M -dimensional vector selecting the element corre-sponding to the m -th microphone. Similarly, the output SNR of theleft and the right output signals is given by the ratio of the outputPSD of the desired speech component and the output PSD of thenoise component, i.e. SNR outL = w H L R x w L w H L R n w L , SNR outR = w H R R x w R w H R R n w R . (9)

3. BINAURAL MVDR BEAMFORMER

The BMVDR [2, 15] aims at minimizing the output noise PSDwhile preserving the desired speech component in the referencemicrophone signals ( x L and x R ), hence preserving the binaural cuesof the desired speech source. The optimization problem for the leftﬁlter vector w L is given by min w L w H L R n w L subject to w H L a L = 1 . (10)The optimization problem for the right ﬁlter vector w R is deﬁnedsimilarly. The ﬁlter vectors solving the optimization problems areequal to [1, 2, 15] w L = R − a L a H L R − a L , w R = R − a R a H R R − a R . (11) w L w R ... ... . . .y L ,M D y L , y L , y R ,M D y R , y R , z L z R y E , y E , y E ,M E external microphonesleft side right side Figure 1: Binaural hearing device conﬁguration incorporatingmultiple external microphones.Hence, estimates of the noise covariance matrix R n and the RTF vec-tors a L and a R are required to compute the BMVDR ﬁlter vectorsin practice. Typically, the noise covariance matrix R n is recursivelyestimated from the microphone signals during speech pauses, e.g.,based on a voice activity detector or speech presence probability [16].The following sections describe different methods to estimatethe RTF vectors a L and a R . Section 4 describes the covariancewhitening method, which is a state-of-the-art RTF vector estimationmethod for a general noise ﬁeld. In Section 5 we propose RTFvector estimation methods that assume that the noise componentin each external microphone signal is uncorrelated with the noisecomponents in all other microphone signals.

4. COVARIANCE WHITENING METHOD

The covariance whitening (CW) method [13, 14] is based on thegeneralized eigenvalue decomposition of the noisy input covariancematrix R y and the noise covariance matrix R n . Using the Choleskydecomposition of the noise covariance matrix, i.e. R n = R H/ R / , (12)the pre-whitened noisy input covariance matrix is deﬁned as R wy = R − H/ R y R − / . (13)Using (12) and (13), the left RTF vector can be estimated as [14] a CWL = R / pe T L R / p , (14)with p = P{ R wy } the principal eigenvector (corresponding tothe largest eigenvalue) of the pre-whitened noisy input covariancematrix R wy . Due to the Cholesky decomposition and the M × M -dimensional eigenvalue decomposition (EVD) the CW methodtypically has a rather large computational complexity, especially forlarge M .

5. SPATIAL COHERENCE METHOD

In this section, we propose RTF vector estimation methods thatassume that the noise component in each external microphone signal

019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY is uncorrelated with the noise components in all other microphonesignals. This can, e.g., be assumed for a diffuse noise ﬁeld whenthe external microphones are spatially separated from each otherand from the head-mounted microphones. In Section 5.1, we reviewthe SC method as presented in [11] for one external microphone.In Section 5.2, we propose three different procedures to linearlycombine the RTF vector estimates obtained by using the SC methodfor each external microphone.

If the noise component in the i -th external microphone signal isuncorrelated with the noise components in all other microphonesignals, it has been shown in [11, 12] that the left RTF vector canbe efﬁciently estimated from the noisy input covariance matrix R y using the SC method as a SC − i L = R y e E ,i e T L R y e E ,i , i ∈ { , . . . , M E } , (15)with e E ,i an M -dimensional vector, selecting the element corre-sponding to the i -th external microphone. The estimator in (15)yields an unbiased RTF vector estimate, except for a biased estimateof the RTF corresponding to the i -th external microphone. However,in [12] it has been shown that this bias is real-valued (hence notaffecting the phase of the RTF vector estimate), depends on the inputSNR in the i -th external microphone and typically can be neglectedin practice. Since in practice an estimate of the noisy input covariance matrix ˆ R y is used in (15), typically M E different SC-based RTF vectorestimates are obtained, such that the question arises how to use theseestimates. In this paper, we propose to linearly combine the differentRTF vector estimates (per frequency) and to use the resulting RTFvector in the BMVDR. The (normalized) combined RTF vectorestimate is given by a SC − CL = A SCL ce T L A SCL c (16)with A SCL an M × M E -dimensional matrix, containing the M E SC-based RTF vector estimates, i.e. A SCL = (cid:104) a SC − , . . . , a SC − M E L (cid:105) , (17)and c an M E -dimensional (complex-valued) combination vector.Please note that the combination closest to the true RTF vector a L could be obtained by orthogonally projecting a L on the columnspace of A SCL , which is obviously not possible in practice. In thefollowing, we hence propose three different procedures to determinethe combination vector c in practice.The ﬁrst procedure, denoted as iSNR , is to select the RTF vectorestimate (per frequency) corresponding to the external microphonewith the highest narrowband input SNR, similarly to [17]. Due to(8), this only requires an estimate of R y (and not R x ), i.e. c iSNR = e E , ˆ i , ˆ i = arg max i e T E ,i R y e E ,i e T E ,i R n e E ,i . (18) Especially for a dynamic acoustic scenario with a moving speaker,the iSNR-based selection procedure is expected to outperform theSC method only using on one external microphone.Assuming a uniform distribution of the estimation errors for theSC-based RTF vector estimates, in the second procedure, denoted as AV , we propose to simply average the estimates, i.e. c AV = (cid:20) M E , . . . , M E (cid:21) T . (19)Intuitively, this procedure is sub-optimal, especially when theestimation errors are very different.As a more sophisticated procedure, denoted as mSNR , we pro-pose to combine the SC-based RTF vector estimates (per frequency)such that the narrowband output SNR of the BMVDR is maximized.Using (16) in (11), the left output SNR in (9) can be written as thegeneralized Rayleigh quotient SNR outBMVDR , L = c H Λ cc H Λ c − , (20)with Λ = ( A SCL ) H R − R y R − A SCL , (21) Λ = ( A SCL ) H R − A SCL . (22)Aiming at maximizing the output SNR of the BMVDR, the SNR-maximizing combination vector c mSNR is equal to the principaleigenvector of the M E × M E -dimensional matrix Λ − Λ , i.e. c mSNR = arg max c SNR outBMVDR , L = P{ Λ − Λ } (23)which hence also only requires an estimate of R y (and not R x ).Although constructing the matrices Λ and Λ comes with somecomputational complexity, the computational complexity of the M E × M E -dimensional EVD is always smaller than the M × M -dimensional EVD required for the CW method (cf. Section 4).

6. EXPERIMENTAL RESULTS

For a dynamic acoustic scenario with a moving speaker in areverberant room, in this section we compare the performanceof the BMVDR using the different RTF vector estimation meth-ods described in Sections 4 and 5 for a binaural hearing deviceincorporating three external microphones.

All signals were recorded in a laboratory where the reverberationtime can be varied using absorber panels mounted on the walls andthe ceiling. The room dimensions are about (7 × × .

7) m andthe reverberation time was set to approximately

400 ms . A KEMARdummy head was placed approximately in the center of the roomwith two behind-the-ear (BTE) hearing devices mounted to the ears.Two microphones per hearing device, i.e. M D = 2 , with an inter-microphone distance of about were used. In addition, M E = 3 external microphones were placed in front of the dummy head asdepicted in Figure 2. Hence, in total M = 7 microphones were usedfor the BMVDR. The desired speech source was a male speaker,walking from the ﬁrst external microphone ( E1 ) to the third externalmicrophone ( E3 ) while speaking ten German sentences with pausesof about half a second between the sentences. Pseudo-diffuse

019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY left device right device

E1 E3E21 . . . Figure 2: Experimental setup with BTE hearing devices mounted ona dummy head and three external microphones.background noise was generated using four loudspeakers facingthe corners of the laboratory, playing back different multi-talkerrecordings. The desired speech source and the background noisewere recorded separately and mixed afterwards. Due to the movingspeaker, the input SNR in the head-mounted reference microphonesignals varied between approximately 0 and 6 dB, while the inputSNR in the external microphone signals varied approximatelybetween 0 and 11 dB. All signals were recorded synchronously,hence neglecting synchronization and latency aspects.All signals were sampled at a sampling rate of 16 kHz andprocessed in the short-time Fourier transform domain using a 32ms square-root Hann window with 50% overlap. To distinguishbetween speech-plus-noise and noise-only time-frequency bins, theestimated speech presence probabilities [16] in the three (noisy)external microphone signals were averaged and thresholded. Thenoisy input covariance matrix R y and the noise covariance matrix R n were then recursively estimated during detected speech-plus-noise and noise-only bins, respectively, using time constants of 250ms ( R y ) and 1.5 s ( R n ).As performance measure, we used the binaural SNR improve-ment ( ∆BSNR ), which is deﬁned similarly as in (8) and (9) as ∆BSNR =10 log (cid:18) w H L R x w L + w H R R x w R w H L R n w L + w H R R n w R (cid:19) (24) −

10 log (cid:18) e H L R x e L + e H R R x e R e H L R n e L + e H R R n e R (cid:19) . The binaural SNR improvement was computed in the time-domainusing the shadow ﬁlter approach.Seven different RTF vector estimates were considered for theBMVDR in (11): • The state-of-the-art CW estimate in (14) • The SC estimate in (15) using each external microphoneseparately, i.e.

SC-1 , SC-2 and

SC-3 • The proposed SC-C method using the combination vectors in(18), (19) and (23), i.e. iSNR , AV and mSNR6.2. Results Figure 3 depicts the ∆BSNR (averaged over time and frequency)for all considered RTF vector estimates. The CW method as astate-of-the-art benchmark yields an average ∆BSNR of 10.4 dB.The SC method using one external microphone, i.e. SC-1, SC-2 andSC-3, yields an average ∆BSNR of about 9 dB and hence could notreach the performance of the CW method. The input SNR-based

CW SC-1 SC-2 SC-3 iSNR AV mSNR67891011

Figure 3: Binaural SNR improvement for all considered RTF vectorestimation methods, averaged over time and frequency.

Time [s] iSNR AV mSNR

Figure 4: Binaural SNR improvement over time for the iSNR, AVand mSNR combination procedures, averaged over frequency.combination (iSNR) yields an average ∆BSNR of 10.3 dB, which issimilar to the CW method. The averaging combination (AV) yieldsan average ∆BSNR of only 8.9 dB, which is even worse than the SCmethod per external microphone. This can probably be explained bythe rather different RTF vector estimation errors for the three externalmicrophones. The SNR-maximizing combination (mSNR) yields anaverage ∆BSNR of 10.7 dB, hence outperforming all other combi-nation procedures and RTF vector estimation methods. Comparingthe computational complexity of the best three methods, the CWmethod has the largest complexity due to the 7-dimensional EVD,whereas the mSNR combination only requires a 3-dimensional EVDand the iSNR combination does not even require an EVD. Never-theless, the mSNR combination enables to improve the ∆BSNR byabout 0.5 dB compared to the iSNR combination. Figure 4 depictsthe ∆BSNR over time (averaged over frequency) for the SC-Cmethod using the proposed combination vectors in more detail. Itcan be observed that the mSNR combination outperforms the iSNRand AV combination for almost all time instances. The sound ﬁlesof the input and output signals are available at [18].

7. CONCLUSIONS

In this paper, we proposed to use the SC-based RTF vector estimationmethod for a scenario where multiple external microphones areincorporated into the BMVDR processing of a binaural hearingdevice. Each external microphone was used to obtain an SC-basedRTF vector estimate. We proposed to linearly combine the differentRTF vector estimates using an input SNR-based selection, simpleaveraging and a combination that maximizes the narrowband outputSNR of the BMVDR. Experimental evaluation in a dynamic scenariowith a moving speaker in a reverberant environment showed thatthe SNR-maximizing combination yields the largest binaural SNRimprovement and also outperforms the state-of-the art covariancewhitening method.

019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY

8. REFERENCES [1] S. Doclo, W. Kellermann, S. Makino, and S. E. Nordholm,“Multichannel signal enhancement algorithms for assistedlistening devices: Exploiting spatial diversity using multiplemicrophones,”

IEEE Signal Processing Magazine , vol. 32,no. 2, pp. 18–30, Mar. 2015.[2] S. Doclo, S. Gannot, D. Marquardt, and E. Hadad, “Binauralspeech processing with application to hearing devices,” in

Audio Source Separation and Speech Enhancement . Wiley,2018, ch. 18, pp. 413–442.[3] B. Cornelis, S. Doclo, T. Van den Bogaert, J. Wouters,and M. Moonen, “Theoretical analysis of binaural multi-microphone noise reduction techniques,”

IEEE Transactionson Audio, Speech and Language Processing , vol. 18, no. 2, pp.342–355, Feb. 2010.[4] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhance-ment using beamforming and nonstationarity with applicationsto speech,”

IEEE Transactions on Signal Processing , vol. 49,no. 8, pp. 1614–1626, 2001.[5] A. Bertrand and M. Moonen, “Robust distributed noisereduction in hearing aids with external acoustic sensor nodes,”

EURASIP Journal on Advances in Signal Processing , vol.2009, p. 14 pages, Jan. 2009.[6] J. Szurley, A. Bertrand, B. van Dijk, and M. Moonen, “Binauralnoise cue preservation in a binaural noise reduction systemwith a remote microphone signal,”

IEEE/ACM Transactionson Audio, Speech and Language Processing , vol. 24, no. 5, pp.952–966, May 2016.[7] M. Farmani, M. S. Pedersen, Z.-H. Tan, and J. Jensen,“Informed sound source localization using relative transferfunctions for hearing aid applications,”

IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 25,no. 3, pp. 611–623, Mar. 2017.[8] D. Yee, H. Kamkar-Parsi, R. Martin, and H. Puder, “A noisereduction post-ﬁlter for binaurally-linked single-microphonehearing aids utilizing a nearby external microphone,”

IEEE/ACM Transactions on Audio Speech and LanguageProcessing , vol. 26, no. 1, pp. 5–18, Jan. 2018.[9] R. Ali, T. van Waterschoot, and M. Moonen, “Generalisedsidelobe canceller for noise reduction in hearing devicesusing an external microphone,” in

Proc. IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) , Calgary, Canada, Apr. 2018, pp. 521–525.[10] ——, “Completing the RTF vector for an MVDR beam-former as applied to a local microphone array and an externalmicrophone,” in

Proc. International Workshop on AcousticSignal Enhancement (IWAENC) , Tokyo, Japan, Sep. 2018, pp.211–215.[11] N. G¨oßling and S. Doclo, “Relative transfer function estimationexploiting spatially separated microphones in a diffuse noiseﬁeld,” in

Proc. International Workshop on Acoustic Signal En-hancement (IWAENC) , Tokyo, Japan, Sep. 2018, pp. 146–150.[12] ——, “RTF-steered binaural MVDR beamforming incorporat-ing an external microphone for dynamic acoustic scenarios,”in

Proc. IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , Brighton, UK, May 2019,pp. 416–420. [13] S. Markovich, S. Gannot, and I. Cohen, “Multichanneleigenspace beamforming in a reverberant noisy environmentwith multiple interfering speech signals,”

IEEE Transactionson Audio, Speech, and Language Processing , vol. 17, no. 6,pp. 1071–1086, Aug. 2009.[14] S. Markovich-Golan and S. Gannot, “Performance analysisof the covariance subtraction method for relative transferfunction estimation and comparison to the covariance whiten-ing method,” in

Proc. IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , Brisbane,Australia, Apr. 2015, pp. 544–548.[15] S. Doclo, S. Gannot, M. Moonen, and A. Spriet, “Acousticbeamforming for hearing aid applications,” in

Handbook onArray Processing and Sensor Networks . Wiley, 2010, pp.269–302.[16] T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-basednoise power estimation with low complexity and low trackingdelay,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 20, no. 4, pp. 1383–1393, May 2012.[17] T. C. Lawin-Ore and S. Doclo, “Reference microphoneselection for MWF-based noise reduction using distributedmicrophone arrays,” in