[PDF] Speech Dereverberation in the STFT Domain

Abstract

Reverberation is damaging to both the quality and the intelligibility of a speech signal. We propose a novel single-channel method of dereverberation based on a linear filter in the Short Time Fourier Transform domain. Each enhanced frame is constructed from a linear sum of nearby frames based on the channel impulse response. The results show that the method can resolve any reverberant signal with knowledge of the impulse response to a non-reverberant signal.

Full PDF

SSPEECH DEREVERBERATION IN THE STFT DOMAIN

Richard Stanton and Mike Brookes

Imperial College London { rs408, mike.brookes } @imperial.ac.uk ABSTRACT

Reverberation is damaging to both the quality and the intelli-gibility of a speech signal. We propose a novel single-channelmethod of dereverberation based on a linear ﬁlter in the ShortTime Fourier Transform domain. Each enhanced frame isconstructed from a linear sum of nearby frames based on thechannel impulse response. The results show that the methodcan resolve any reverberant signal with knowledge of the im-pulse response to a non-reverberant signal.

Index Terms — dereverberation, inverse channel ﬁltering,speech enhancement

1. INTRODUCTION

Speech is inherently non-stationary, therefore speech process-ing algorithms are frequently applied to short frames in whichthe speech is quasi-stationary. Furthermore, speech is sparsein the time-frequency domain, allowing us to distinguish andenhance the speech content well. Therefore the Short TimeFourier Transform (STFT) domain is the domain of choicefor many speech and audio based algorithms.Reverberation occurs from multi-path propagation of anacoustic signal, s [ n ] , through a channel with impulse response h [ n ] to a microphone. Reverberation causes speech to sounddistant and spectrally distorted which reduces intelligibility[1]. The further the source from the microphone the greaterthe effects of reverberation. Automatic speech recognition isseverly hindered by reverberation [2, 3]. Beamformers utilisethe time difference of arrival to each sensor in an array tospatially ﬁlter a sound ﬁeld. Due to the multi-path propa-gation, beamformers fail in reverberant environments. There-fore channel inversion methods are of high importance in spa-tial ﬁltering ﬁelds.There already exists several dereverberation algorithms inthe STFT domain. For example spectral subtraction has beenused to estimate the power spectrum of the late reverberationand subtract this from the current spectrum to leave the directpath, [4]; this approach was extended in [5] to introduce thefrequency dependence of the reverberation time.Other methods of dereverberation exist which utiliseknowledge of the system impulse response, h [ n ] , howevernow exist in the STFT domain. Least squares has previously been used to create an inverse ﬁlter from knowledge of theimpulse response, [6]. This was extended into the multichan-nel domain with the Multiple-input/output INverse Theorem(MINT), [7], which is capable of ﬁnding exact inverse ﬁlters,through the use of multiple transmission channels.We wish to create an algorithm in the STFT domain whichutilises knowledge of the impulse response, h [ n ] , for the usesof dereverberation. However simply creating an inverse ﬁl-ter in the STFT domain is not straightforward, as the STFTprocess is time-variant. We present a single-channel methodof dereverberation based on a linear ﬁlter which combinesnearby frames which uses a novel method to account for thetime varying nature of the STFT domain. The frames are lin-early combined using coefﬁcients computed through a leastsquares based method on the impulse response.The remainder of the paper is as follows. In Section 2 themethod is outlined. Section 3 details the process to select theoptimal coefﬁcients for dereverberation. The results of thealgorithm are detailed in Section 4 and conclusions are drawnin Section 5.

2. STFT-DOMAIN DEREVERBERATION

The observed reverberant signal, y [ n ] , at the microphone isthe convolution of the source signal, s [ n ] , and the channelimpulse response, h [ n ] : y [ n ] = M − (cid:88) m =0 h [ m ] s [ n − m ] . (1)Exploiting knowledge of the channel impulse response, wepropose a new method to reduce the effects of reverberationon y [ n ] , to form an estimate, ˆ s [ n ] , of the original signal.The reverberant signal is transformed into the STFT do-main using a window, w [ n ] and an overlapping factor Q : Y k [ l ] = QR − (cid:88) n =0 y [ n + lR ] w [ n ] e − j π knQR , (2)where l represents the frame number, k the frequency binand R the frame increment. The enhanced signal is formedthrough a linear sum of nearby frames of the reverberant sig- a r X i v : . [ c s . S D ] S e p al: (cid:98) S k [ l ] = B (cid:88) r = − A G k [ r ] Y k [ l − r ] , (3)where A is the number of future frames and B is the numberof past frames to be used in the enhancement. The resultingframes are then transferred back into time frames with theinverse Discrete Fourier Transform (DFT): ˆ s [ l, m ] = 1 QR QR − (cid:88) k =0 ˆ S k [ l ] e j π kmQR , (4)which are then overlap-added [8] to form the enhanced timesignal: ˆ s [ n ] = l n (cid:88) l = l n − Q +1 ˆ s [ l, n − lR ] w [ n − lR ] . (5)where l n = (cid:4) nR (cid:5) . Perfect reconstruction, ˆ s [ n ] = y [ n ] , isobtained with the coefﬁcients G k [ r ] = δ [ r ] provided that thewindow used for analysis and synthesis satisﬁes: Q − (cid:88) q =0 w [ qR + n ] = 1 ∀ n ∈ [0 , R − .

3. OPTIMAL COEFFICIENTS

Assuming that h [ n ] is known, our goal is to determine theﬁlter coefﬁcients G k = (cid:2) G k [ − A ] . . . G k [ B ] (cid:3) T so that ˆ s [ n ] ≈ s [ n ] .Consider the response of (3) when the input signal is animpulse at sample λ : s ( λ ) [ n ] = δ [ n − λ ] , λ ∈ [0 , R − . When processing in the STFT domain, the earliest out-put frame that is affected by the impulse occurs at l min =1 − Q − A , whereas the latest frame affected is l max =1 + B + (cid:4) M + λ − R (cid:5) . Appling the process from (3) we canﬁnd a relationship between the channel STFT of the impulseresponse, H λ [ l, k ] , and the desired impulse response ˜ H λ [ l, k ] ,which is the STFT of the direct path impulse response, whenthere are no reﬂections present.We determine G k to minimise the difference between thetwo. So for each frequency bin, k , we have an overdeterminedset of equations: ˆ H ( λ ) [ l, k ; G k ] = B (cid:88) r = A G k [ r ] H ( λ ) [ l − r, k ] ≈ ˜ H ( λ ) [ l, k ] , (6)for each λ = [0 : R − and l = [ l min : l max ] . This givesus (2 + A + B + Q ) R + M − equations, with A + B + 1 timefrequency Fig. 1 . The above plots show the STFT of both H [ l, k ] and ˜ H [ l, k ] . For each frequency bin the ﬁlter linearly combinesfuture and past frames of H [ l, k ] to best match ˜ H [ l, k ] .unknowns. This process is shown in Fig. 1. We combine B past frames with A future frames to best approximate thecurrent frame from the desired impulse response.We solve these equations using linear least squares, [9], toﬁnd: G k = arg min G k R − (cid:88) λ =0 l max (cid:88) l = l min (cid:16) ˆ H ( λ ) [ l, k ; G k ] − ˜ H ( λ ) [ l, k ] (cid:17) . (7)The overall impulse response of the computed channelis time-variant but we can determine an average channel re-sponse as the inverse STFT of: ˆ H [ l, k ] = 1 R R − (cid:88) λ =0 ˆ H ( λ ) [ l, k ; G k ] exp (cid:18) πk λQR (cid:19) , (8)where a phase shift is applied to correspond with the sampleposition within the frame. The above minimisation problem minimises the reverberationpresent in the enhanced signal. Let us deﬁne the error in theimpulse responses in both the time domain and the STFT do-main as: h e [ n ] = ˜ h [ n ] − ˆ h [ n ] . The error in a single frame in the STFT domain is as follows: H e,k [ l ] = ˜ H ( λ ) k [ l ] − ˆ H ( λ ) k [ l ] . The total power of the error in the STFT domain across allframes, frequencies and shifts is denoted: P f [ n ] = 1 QR QR − (cid:88) k =0 R − (cid:88) λ =0 l max (cid:88) l = l min | H e [ l, k ] | . Using Parseval’s theorem, the power of the error in the timedomain is given as: QR − (cid:88) n =0 | h e [ l, n ] | = 1 QR QR − (cid:88) k =0 R − (cid:88) λ =0 l max (cid:88) l = l min | H e [ l, k ] | . (9)lternatively we can express the error power, in the time do-main, as the weighted sum of the frames with the windowfunction: h e [ lR + n ] = Q − (cid:88) q =0 w [ qR + n ] h e [ qR + n, l − q ] . We sum over all time samples to give the total error: N − (cid:88) l =0 R − (cid:88) n =0 h e [ lR + n ] = (10) (cid:88) l R − (cid:88) n =0 (cid:32) Q − (cid:88) q =0 w [ qR + n ] h e [ qR + n, l − q ] (cid:33) . Thus applying the Cauchy Schwatz inequality to (9) and (10),we can show that the error in the STFT domain is an upperbound for the time domain error: N − (cid:88) l =0 R − (cid:88) n =0 (cid:32) Q − (cid:88) q =0 w [ qR + n ] h e [ qR + n, l − q ] (cid:33) ≤ QR QR − (cid:88) k =0 R − (cid:88) λ =0 N − (cid:88) l =0 | H e [ l, k ] | . Therefore solving the related problem in the STFT domainplaces an upper bound on the amount of reverberation in ouroutput signal.

4. EVALUATION

To evaluate the reduction in reverberation, we use two met-rics: the Direct-to-Reverberant Ratio (DRR) [10] and theSignal-to-Reverberation Ratio (SRR) [11]. To evaluate theperceptual quality of the enhanced signals Perceptual Evalua-tion Of Speech Quality (PESQ), [12], is used. The DRR [dB] is deﬁned as follows:

DRR = 10 R R − (cid:88) λ =0 log (cid:26) E d ( λ )( (cid:80) n h λ [ n ]) − E d ( λ ) (cid:27) , (11)where E d is the direct path energy. The direct path in theimpulse response may occur in between samples, thereforethe path energy will be spread across the nearby samples witha sinc function. Thus the direct path energy is computed usinga convolution with a sinc function with a varying offset untila maximum is found: E d ( λ ) = max σ η (cid:88) n = − η (cid:18) sin ( π ( n + σ )) π ( n + σ ) h λ [ n + n d ] (cid:19) , where n d is the nearest index of the direct path in the impulseresponse, η = 8 is the number of sidelobes of the sinc func-tion to use in the summation and σ = [ − is the offsetthat ﬁnds the maximum power. The SRR [dB] is deﬁned on a frame by frame basis andthen averaged across the whole signal: SRR seg = 10 M M − (cid:88) k =0 log (cid:40) (cid:80) kR + QR − n = kR s d [ n ] (cid:80) kR + QR − n = kR ( s d [ n ] − ˆ s [ n ]) (cid:41) , (12)where M is the total number of frames, s d [ n ] represents theorignal direct path signal and ˆ s [ n ] is the enhanced signal. Itgives a measure of the reverberation power in relation to theuseful direct path. It is a similar measure to the DRR but usesspeech signals rather than the channel response.The optimal coefﬁcients from Section 3 were calculatedfor a Room Impulse Response (RIR) and the correspondingchannel response from (8) was found. A total of 600 RIRswere used to test the system. These correspond to a singlesource and microphone in 40 different rooms and 15 differentposition combinations in each. The impulse responses weregenerated using the Room Impulse Response Generator from[13], which is based on the image method [14]. In all cases Q = 4 , R = 64 , A = 9 , B = 9 .As both the SRR and PESQ work on speech samples theTIMIT core test set [15] was chosen. Each speech sample wasconvolved with each h [ n ] before undergoing enhancement asdescribed in (3). The before and after signals, y [ n ] and ˆ s [ n ] ,were then used with the SRR and PESQ metrics to gauge anyimprovement.The performance of the proposed algorithm has beencompared to the time domain inverse ﬁlter as proposed byWidrow, [6]. The method designs an inverse ﬁlter, g [ n ] ,through least squares to best invert the system response, h [ n ] ,[7]:  ... ...  =  h [0] 0 ... h [0] h [ N h − ... . . . h [ N h − h [0] . . . ... h [ N h −  × (cid:2) g [0] g [1] . . . g [ M − (cid:3) T , where N h = 1024 in our case. The DRR was computed for both h [ n ] and ˆ h [ n ] across all 600RIRs. The results comparing the DRR before and after thealgorithm are shown in Fig. 2. The DRR improved for allthe impulse responses tested except those where the originalDRR exceed dB. The resulting performance is independentof the amount of reverberation in the initial signal and hoversclose to , giving an improvement of up to

34 dB . Thusthe algorithm is able to reduce reverberation to the same levelregardless of how reverberant the original channel is.

100 200 300 400 500 600−50−40−30−20−1001020

DRR [ d B ] RIR indexDRR Performance proposedinverse filterreverberant

Fig. 2 . The DRR after the algorithm for 600 RIRs showsa sizeable improvement over the initial reverberant signal,mean average improvement of . .The averaged SRR for each RIR is shown in Fig. 3. It fol-lows a similar pattern to the DRR. The enhanced signals hoveraround . When the original SRR surpassed , the al-gorithm was unable to make any further improvements, andcaused slight degredation to these non-reverberant signals.The averaged PESQ results are shown in Fig. 4. Theenhancement gave a small gain in perceptual quality which,whilst it does not show the removal of reverberation, doesshow that the algorithm does not introduce signiﬁcant distor-tion. Due to the limited improvement in the perceived speechquality the algorithm has good uses in approaches which re-quire signals without reverberation, rather than end user per-ceptual improvements.Samples of the reverberant and processed speech areavailable on the internet: [16].

5. CONCLUSIONS

We have described a novel approach to dereverberation usinga linear ﬁlter in the STFT domain. Using knowledge of thechannel impulse response we can ﬁnd an optimal combinationof frames to reduce the effects of reverberation. The algo-rithm gives clear performance gains in dereverberation. Boththe DRR and the SRR show that regardless of the amount ofinitial reverberation present, the enhanced signal has a simi-lar low level of reverberation present, whilst not introducingdistortion.We have shown that the proposed STFT domain algorithmis as good as the time domain inverse ﬁlter; allowing us toapply dereverberation in the more appropriate domain withoutloss of performance.We have overcome the time-variance of the STFT byconsidering all the possible impulse positions within a singleframe.By working in the STFT domain we can solve each fre-quency band, k , independently. The above give a useful S RR s eg [ d B ] RIR indexSRR Performance proposedinverse filterreverberant −1 0 1 2 3 4 5050001000015000 SRR Improvement Proposed vs Inverse FilterSRR difference − dB O cc u rr en c e s Fig. 3 . The speech signals after enhancement show a muchimproved SRR compared to the reverberant signals, mean av-erage improvement of . .framework that suits many applications already processing inthis domain. PES Q RIR indexPESQ Performance proposedinverse filterreverberant

Fig. 4 . PESQ is shown for 600 different RIRs before and afterenhancement. Each point is the average of 240 utterances forthat RIR, mean average improvement of .

08 PESQ . EFERENCES [1] P. A. Naylor and N. D. Gaubitch, “Speech dereverber-ation,” in

Proc. Intl. Workshop Acoust. Echo and NoiseControl (IWAENC) , Eindhoven, The Netherlands, Sept.2005.[2] Brian E. D. Kingsbury, Nelson Morgan, and StevenGreenberg, “Robust speech recognition using the mod-ulation spectrogram,”

Speech communication , vol. 25,no. 1, pp. 117–132, 1998.[3] A. Sehr, M. Zeller, and W. Kellermann, “Hands-freespeech recognition using a reverberation model in thefeature domain,” in

Proc. European Signal ProcessingConf. (EUSIPCO) , Florence, Italy, Sept. 2006.[4] K. Lebart, J. M. Boucher, and P. N. Denbigh, “A newmethod based on spectral subtraction for speech de-reverberation,”

Acta Acoustica , vol. 87, pp. 359–366,2001.[5] E. A. P. Habets, “Single-channel speech dereverber-ation based on spectral subtraction,” in

Proc. Work-shop Circuits, Systems and Signal Processing (ProR-ISC) , Veldhoven, The Netherlands, Nov. 2004, pp. 250–254.[6] Bernard Widrow and Eugene Walach, “Adaptive sig-nal processing for adaptive control,” in

Proc. IEEEIntl. Conf. on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 1984, vol. 9, pp. 191–194.[7] M. Miyoshi and Y. Kaneda, “Inverse ﬁltering of roomacoustics,”

IEEE Trans. Acoust., Speech, Signal Pro-cess. , vol. 36, no. 2, pp. 145–152, Feb. 1988.[8] J. Allen and L. Radiner, “A uniﬁed approach to short-time Fourier analysis and synthesis,”

Proc. IEEE , vol.65, no. 11, pp. 1558–1564, 1977.[9] Charles L Lawson and Richard J Hanson,

Solving leastsquares problems , vol. 161, SIAM, 1987.[10] Adelbert W Bronkhorst and Tammo Houtgast, “Audi-tory distance perception in rooms,”

Nature , vol. 397,no. 6719, pp. 517–520, 1999.[11] Schuyler R. Quackenbush, Thomas P. Barnwell, III,and Mark A. Clements,

Objective Measures of SpeechQuality , Prentice Hall, Jan. 1988.[12] ITU-T, “Perceptual evaluation of speech quality(PESQ), an objective method for end-to-end speechquality assessment of narrowband telephone networksand speech codecs,” Recommendation P.862, Interna-tional Telecommunications Union (ITU-T), Feb. 2001.[13] E. A. P. Habets, “Room impulse response generator,”Tech. Rep., Technische Universiteit Eindhoven, 2006.[14] J. B. Allen and D. A. Berkley, “Image method for ef-ﬁciently simulating small-room acoustics,”

J. Acoust.Soc. Am. , vol. 65, no. 4, pp. 943–950, Apr. 1979. [15] John S. Garofolo, Lori F. Lamel, William M. Fisher,Jonathan G. Fiscus, David S. Pallett, Nancy L.Dahlgren, and Victor Zue, “TIMIT acoustic-phoneticcontinuous speech corpus,” Corpus LDC93S1, Linguis-tic Data Consortium, Philadelphia, 1993.[16] Richard Stanton, “Speech dereverberation in the STFTdomain,”