[PDF] Dereverberation using joint estimation of dry speech signal and acoustic system

Abstract

The purpose of speech dereverberation is to remove quality-degrading effects of a time-invariant impulse response filter from the signal. In this report, we describe an approach to speech dereverberation that involves joint estimation of the dry speech signal and of the room impulse response. We explore deep learning models that apply to each task separately, and how these can be combined in a joint model with shared parameters.

Full PDF

DDereverberation using joint estimation of dry speech signal and acousticsystem

Sanna Wager ∗ , , Keunwoo Choi , Simon Durand Spotify, S.A., New York Indiana University Bloomington, Luddy School of Informatics, Computing, and Engineering [email protected], [email protected], [email protected]

Abstract

The purpose of speech dereverberation is to remove quality-degrading effects of a time-invariant impulse response ﬁlterfrom the signal. In this report, we describe an approach tospeech dereverberation that involves joint estimation of the dryspeech signal and of the room impulse response. We exploredeep learning models that apply to each task separately, andhow these can be combined in a joint model with shared param-eters.

Index Terms : dereverberation, deep learning, room impulse re-sponse estimation

1. Introduction

Room acoustics can cause degradation of audio quality inrecordings produced outside of professional recording studios.Acoustics in everyday environments such as the home com-monly produce undesirable effects including excessive col-oration from early reﬂections and masking from late reﬂections[1]. One may wish to mitigate these effects by dereverberatingthe speech signals. This type of speech enhancement can be ap-plied in the post-processing stage, where the full signal has beenrecorded in advance, or in real time, if low latency is required.This technical report addresses the former case.We focus on the single-channel case, where only one micro-phone was used to record the speech signal. The single-channelcase is challenging because, unlike in the multi-channel case,a reverberation reduction program cannot utilize the differencein time of arrivals of reﬂections across the microphones. In therecent REVERB challenge [2], only one among eleven submit-ted single-channel systems both reduced the perceived amountof reverberation and improved overall quality.The task of reverberation reduction has been framed in mul-tiple ways over the past ﬁve decades. Habets [3] describes threetrends. The ﬁrst involves modeling the acoustic system and us-ing this information to equalize or ﬁlter the reverberant signal.The second approach is similar to denoising, treating the sourceand reverberation signals as independent. The third directly es-timates the dry signal without trying to model the room acous-tics.The techniques in the ﬁrst trend, which involves modelingthe acoustic system, provide clean results under speciﬁc condi-tions [4] and can provide the ability to control the Room Im-pulse Response (RIR) [3]. However, they can be challenging todevelop even when the RIR is known. Though the RIR func-tion is a linear transformation, in realistic room acoustics, an *The research work described in this report was supported by theinternship program at Spotify, S.A., in collaboration with the music in-telligence team. exact inverse is often either unstable or acausal. Exact equaliza-tion requires very long ﬁlters, causing increased computationalcomplexity and numerical instability. Sometimes, computingthe inverse can be intractable [4, 5]. Another issue is that imper-fections in the RIR measurement will affect the result. Faller de-scribes how inversion-based ﬁltering only works in the “sweetspot” [1]. Movement from the “sweet spot” to other positions inthe room degrades the quality. Furthermore, changes in objectpositions in the room, temperature, or humidity will change theRIR, which can be described as a weakly non-stationary pro-cess [5]. The RIR estimate then needs to be updated across thesignal instead of only being computed once. For these reasons,techniques that remove only the most problematic componentsof the reverberation while leaving the rest intact have often beenchosen for practical uses [5].Approaches involving denoising or source separation do notnecessarily generalize to the dereverberation task. The gener-alization is most problematic in the case of models where thetarget and noise signals are assumed independent—for exam-ple, time-frequency bin masking [6]. Masking works best whenbins belong either to the signal or to the noise, and there isnot much overlap. A reverberant signal, though, is a sum ofdelayed and ﬁltered versions of the original signal, which arehighly correlated and likely present in the same time-frequencybins. Another assumption commonly made for denoising andthat does not apply to dereverberation is that every time framecan be treated separately. This assumption also does not gener-alize well because RIRs often last longer than one Short-TimeFourier Transform (STFT) window.Direct estimation approaches, e.g., [7], focus instead on es-timating a clean signal given a reverberant input. They do notdirectly estimate room acoustics, or separate the dry and re-verberant signals. Direct estimation can produce good resultswhen the model is able to represent the distribution of the tar-get outputs—speech signals. However, the nature of the directestimation approach means that some useful information aboutthe room acoustics is not utilized, which could negatively af-fect the model’s ability, for example, to adapt to unseen roomconditions.In this report, we conduct preparatory work for developinga joint model, which combines the direct estimation approachwith estimation of the RIR. A joint model, which involvesshared parameters for these two tasks, can produce higher ac-curacy than separate models. We utilize the fact that we haveaccess both to pairs of dry and reverberant signals and the RIRsused to generate these pairs via convolution. Access to the targetdry speech and RIR signals allows us to train both componentsof the model in a supervised manner. In this report, we exploreseparate models for each task, and provide an overview of howthese models can be combined into a single one.Section 2 covers related work. In section 3, we provide an a r X i v : . [ ee ss . A S ] J u l verview of the joint training technique. We also describe themodels we use separately for the tasks of dry speech estimationand RIR estimation. In section 4.1, we describe our experimentsalong with the dataset and pre-processing steps. Finally, section5 concludes the report.

2. Related work

Deep neural networks (DNNs) have been investigated for thereverberation reduction task. Fully connected models that learnand process the signal frame-by-frame are proposed in [8, 9].The ﬁrst model outputs a magnitude STFT, while the secondoutputs both real and complex values. Williamson et al. [9]predict one frame at a time from an input of multiple consecu-tive frames. Each input frame includes a set of different time-frequency features extracted from the audio.Ori et al. [7] investigate a fully convolutional model—a U-net [10]—to estimate dry speech directly from reverberant andnoisy speech. The authors predict the magnitude spectrogramand use the reverberant phase for reconstruction of the audiosignal. Results are state-of-the-art. In addition to supervisedtraining, the authors experiment with using Generative Adver-sarial Networks (GANs) to improve over their supervised U-netmodel, but their performance decreases slightly.GANs for the dereverberation task have also been proposedby Li et al. [11]. The authors experiment with supervised mod-els and further train the one that performed the best—a Long-Short-Term Memory Unit (LSTM) [12] using adversarial train-ing. The downstream task in the work, however, is automaticcharacter recognition instead of audio signal enhancement.Engel et al. [13] introduce a DSP-based DNN that includesdifferentiable oscillators, ﬁlters, and reverberation components,allowing for separate control over parameters such as pitch andloudness. The model has an encoder-decoder structure, and in-dicates potential for encoding the dry component of the signalwhile discarding the room reverberation, and reconstructing adry signal, optionally convolving it with a different RIR.We also mention relevant models designed for related taskssuch as speech denoising. Tan et al. [14] introduce the convolu-tional recurrent neural network (CRNN) model. The CRNNcombines the convolutional ﬁlter feature extraction capabil-ity with the sequential modeling of an LSTM in an encoder-decoder setup. Huang et al. address the speech separationtask, using joint training [15] to improve accuracy by estimatingboth the mask and the target speech signal. Relevant work alsoincludes non-negative matrix factorization approaches [16],which separately dereverberate magnitude STFTs over everyfrequency bin. This technique greatly reduces the number ofparameters and inspires our approach for RIR prediction.While prior work exists that involves estimation of RIR pa-rameters such as the direct-to-reverberant ratio or T60 [17, 18],we are not aware of work that directly predicts the RIR from thereverberant speech.

3. Models

Figure 1 illustrates our joint training approach, which involvesestimating both the dry signal and the RIR. Our assumptionis that sharing some of the parameters used for the two taskscan improve accuracy compared to what the two models wouldreach when trained separately. The model would learn to ex-tract the structure of the RIR from the reverberant signal, which Figure 1:

Overview of training technique. The DNN is trainedto jointly estimate the dry signal and the RIR. Parameters areshared across the two tasks. The estimated RIR is convolvedwith the known target dry signal to produce a reverberant sigalestimate. would also make it better able to predict the dry signal. In ad-dition to the dry speech and RIR training targets, we introducea third target: The RIR estimate output by the model is con-volved with the dry signal to produce a reconstruction of thereverberant signal. Given that we synthesize reverberant speechby convolving the dry signal and the RIR, all three targets areknown and we can train the joint model in a supervised mannerwith three weighted loss components. In the following sections,we describe models we use separately for the two tasks. Oncethe separate models produce sufﬁciently accurate results, theycan be combined into a joint model. The individual loss termsare then summed in a weighted manner.

We ﬁrst compare two existing models for estimating the dryspeech directly from the reverberant signal. We modify theLSTM proposed in [19] by using a Bi-directional Gated Recur-rent Unit (Bi-GRU) [20] instead. The model includes residualconnections between layers. Given that we use a bi-directionalmodel, we use half the hidden dimension used by Wang et al. ,which amounts to 380, but produces the original 760 dimen-sions when combining the two directions. We do not use ex-ponential decay like the authors and solely rely on the learningrate decay adjusted by Adam optimizer [21]. The target is alsodifferent as we estimate the log-STFT instead of recognizingcharacters. We empirically ﬁnd that training parameters usedby Wang et al. produce the best results for our model.We also implement the U-net introduced in [7] for compar-ison.

We then explore using a fully convolutional model to estimatethe RIR from the reverberant signal. We again choose to operatein the magnitude-frequency domain. We introduce a convolu-tional model designed to capture the time-invariant ﬁlter struc-ture across a few seconds of the speech signal. It includes ﬁltersthat span multiple time frames but only one frequency bin at atime, assuming the reverberation can be isolated per-frequencybecause the reverberant signal is a linear sum of the dry signal.The structure is as follows, with the ﬁrst two values indicatinghe ﬁlter size (time and frequency axes) and the third the num-ber of ﬁlters: (9 ×

1, 16), (14 ×

1, 32), (27 ×

1, 64), (27 ×

1, 32),(27 ×

1, 16), (28 ×

1, 4), (187 ×

1, 126). Each convolution is fol-lowed by an exponential linear unit activation [22] except forthe ﬁnal one, which is followed by a rectiﬁed linear unit activa-tion [23]. After the ﬁnal layer, the ﬁlter axis with 126 dimensionbecomes the time axis of the RIR. We apply this modiﬁcationas the time-varying structures of the speech and RIR signals areunrelated.

4. Experimental results

For RIRs, we use publicly available recordings downloaded us-ing Kaldi scripts . Data sources consist of the Aachen ImpulseResponse Database [24], the C4DM Room Impulse ResponseDataset [25], the RWCP Sound Scene Database in Real Acous-tical Environments [26], the REVERB Challenge dataset’s RIRs(omitting the noise signals) [27], and the PORI Concert Hall Im-pulse Responses [28]. These combined provide a total of 1069RIRs. Additionally, Steinmetz kindly provided RIRs used forresearch on a NeuralReverberator . This set includes 693 RIRsfrom a variety of sources.We split the RIRs into 1362, 200, and 200 samples for train-ing, validation, and testing, respectively. We group together sig-nals that were too similar, e,g., involved the same source in thesame room, the same receiver in same room, or the same RIRrecorded with different microphones or microphone rotations.When the conﬁgurations were unknown, we group RIRs by thecodes found in the ﬁle names. We assign groups larger than 20to the training set for more balanced validation and test sets. Wealso limit the size of each group to 100 and discard the remain-ing RIRs. This data preparation results in a total 823 groupsdistributed across training, validation, and test sets. We process the input signals using a sample rate of 16000 Hz.We convolve each dry speech signal with multiple, randomlyselected RIRs. We align the dry and reverberant speech in timeby applying the RIR delay to the dry speech. We remove thenear-silent part of the signal at the beginning. Finally, we trun-cate or pad the speech signals so that they last 5 seconds. Wealso zero-pad the RIRs to a minimum length of 2 seconds butdo not truncate longer ones.We compute the STFT with frame size 32 ms and hoplength 16 ms. We normalize the audio by dividing each STFTby its maximum.

The U-net introduced by Ernst et al. produced the best resultsfor the dry speech direct estimation task. The Bi-GRU producedtoo many artifacts to be used in practice. We also explored fullyconnected models, but found that they were not as accurate asthe U-net. For the RIR estimation task, the convolutional modeloutput impulses that had the desired decay but were not pre-cise enough. The convolutional model might not be powerfulenough to capture the time-invariant ﬁlter from the speech sig- https://github.com/kaldi-asr/kaldi/blob/master/egs/aspire/s5/local/multi_condition/rirs/ nal. RIRs have an unusual structure, which might be difﬁcultfor a DNN to predict: a short and loud impulse followed by arapid decay. Structures of signals often estimated using DNNs,including speech signals or time series, have a distribution thatremains relatively stable over time. To properly estimate a RIR,the DNN needs to concentrate most of the output energy into asmall area.We believe a time-domain approach can be used to improvethe precision of the U-net by including phase into the estima-tion. We also expect to improve RIR estimation results by us-ing a sequential model, such as an LSTM in the time-frequencydomain or a CRNN in the time domain. The time-alignment ofthe dry and reverberant speech, along with the removal of anysilence at the beginning of the RIR signal, can also be furtherimproved for better precision.

5. Conclusion

In this technical report, we describe a joint modeling approachto estimating dry speech and the room impulse response froma reverberant speech input. We provide an overview of sep-arate deep learning models for each task and how these can becombined in future work. We operate on magnitude-STFT data.We ﬁnd that a magnitude-STFT U-net [7] provides good resultsfor direct speech estimation. For room impulse response es-timation, we introduce a convolutional model and plan to im-prove it further by modeling the sequential nature of the data.Though there has been some prior work using representationsthat include phase, namely, time-domain signals [29] and com-plex STFTs [9], we leave representations that include the phasefor future work.

6. References [1] C. Faller, “Modifying audio signals for reproduction with reducedroom effect,” in , 2019.[2] K. Kinoshita, M. Delcroix, S. Gannot, E. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani,B. Raj et al. , “A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processingresearch,”

EURASIP Journal on Advances in Signal Processing ,vol. 2016, no. 1, p. 7, 2016.[3] E. Habets, “Fifty years of reverberation reduction: From analogsignal processing to machine learning,” in

AES 60th Conf. onDREAMS , 2016.[4] S. Neely and J. Allen, “Invertibility of a room impulse response,”

The Journal of the Acoustical Society of America , vol. 66, no. 1,pp. 165–169, 1979.[5] S. Cecchi, A. Carini, and S. Spors, “Room response equalizationareview,”

Applied Sciences , vol. 8, no. 1, p. 16, 2018.[6] J. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus-tering: Discriminative embeddings for segmentation and sepa-ration,” in

Int. Conf. Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2016, pp. 31–35.[7] O. Ernst, S. Chazan, S. Gannot, and J. Goldberger, “Speech dere-verberation using fully convolutional networks,” in

European Sig-nal Processing Conference (EUSIPCO) . IEEE, 2018, pp. 390–394.[8] K. Han, Y. Wang, D. Wang, W. Woods, I. Merks, and T. Zhang,“Learning spectral mapping for speech dereverberation and de-noising,”

Trans. Audio, Speech, and Language Processing ,vol. 23, no. 6, pp. 982–992, 2015.[9] D. Williamson and D. Wang, “Speech dereverberation and denois-ing using complex ratio masks,” in

Int. Conf. Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2017, pp. 5590–5594.10] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolu-tional networks for biomedical image segmentation,” in

Int. Conf.Medical Image Computing and Computer-assisted Intervention .Springer, 2015, pp. 234–241.[11] C. Li, T. Wang, S. Xu, and B. Xu, “Single-channel speech dere-verberation via generative adversarial training,” arXiv preprintarXiv:1806.09325 , 2018.[12] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997.[13] J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP:Differentiable digital signal processing,” arXiv preprintarXiv:2001.04643 , 2020.[14] K. Tan and D. Wang, “A convolutional recurrent neural networkfor real-time speech enhancement,” in

Interspeech , 2018, pp.3229–3233.[15] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis,“Joint optimization of masks and deep recurrent neural networksfor monaural source separation,”

Trans. Audio, Speech, and Lan-guage Processing , vol. 23, no. 12, pp. 2136–2147, 2015.[16] N. Mohammadiha, P. Smaragdis, and S. Doclo, “Joint acous-tic and spectral modeling for speech dereverberation using non-negative representations,” in

Int. Conf. Acoustics, Speech and Sig-nal Processing (ICASSP) . IEEE, 2015, pp. 4410–4414.[17] J. Eaton, N. Gaubitch, A. Moore, and P. Naylor, “Estimation ofroom acoustic parameters: The ACE challenge,”

Trans. Audio,Speech, and Language Processing , vol. 24, no. 10, pp. 1681–1693, 2016.[18] N. Bryan, “Data augmentation and deep convolutional neuralnetworks for blind room acoustic parameter estimation,” arXivpreprint arXiv:1909.03642 , 2019.[19] K. Wang, J. Zhang, S. Sun, Y. Wang, F. Xiang, and L. Xie,“Investigating generative adversarial networks based speechdereverberation for robust speech recognition,” arXiv preprintarXiv:1803.10132 , 2018.[20] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalu-ation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555 , 2014.[21] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[22] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accu-rate deep network learning by exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289 , 2015.[23] V. Nair and G. Hinton, “Rectiﬁed linear units improve restrictedboltzmann machines,” in

Proc. Int. Conf. Machine Learning(ICML) , 2010, pp. 807–814.[24] M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse re-sponse database for the evaluation of dereverberation algorithms,”in

Int. Conf. Digital Signal Processing . IEEE, 2009, pp. 1–5.[25] R. Stewart and M. Sandler, “Database of omnidirectional and B-format room impulse responses,” in

Int. Conf. Acoustics, Speechand Signal Processing . IEEE, 2010, pp. 165–168.[26] S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada,“Acoustical sound database in real environments for sound sceneunderstanding and hands-free speech recognition,” 2000.[27] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets,R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas et al. , “The REVERB challenge: A common evaluation frame-work for dereverberation and recognition of reverberant speech,”in

Workshop Applications Signal Processing to Audio and Acous-tics (WASPAA) . IEEE, 2013, pp. 1–4.[28] J. Merimaa, T. Peltonen, and T. Lokki, “Concert hall impulse re-sponses pori, ﬁnland: Reference,”

Tech. Rep , 2005.[29] Y. Luo and N. Mesgarani, “Real-time single-channel dereverbera-tion and separation with time-domain audio separation network,”in