[PDF] Neural Kalman Filtering for Speech Enhancement

Abstract

Statistical signal processing based speech enhancement methods adopt expert knowledge to design the statistical models and linear filters, which is complementary to the deep neural network (DNN) based methods which are data-driven. In this paper, by using expert knowledge from statistical signal processing for network design and optimization, we extend the conventional Kalman filtering (KF) to the supervised learning scheme, and propose the neural Kalman filtering (NKF) for speech enhancement. Two intermediate clean speech estimates are first produced from recurrent neural networks (RNN) and linear Wiener filtering (WF) separately and are then linearly combined by a learned NKF gain to yield the NKF output. Supervised joint training is applied to NKF to learn to automatically trade-off between the instantaneous linear estimation made by the WF and the long-term non-linear estimation made by the RNN. The NKF method can be seen as using expert knowledge from WF to regularize the RNN estimations to improve its generalization ability to the noise conditions unseen in the training. Experiments in different noisy conditions show that the proposed method outperforms the baseline methods both in terms of objective evaluation metrics and automatic speech recognition (ASR) word error rates (WERs).

Full PDF

NNeural Kalman Filtering for Speech Enhancement

Wei Xue, Gang Quan, Chao Zhang, Guohong Ding, Xiaodong He, Bowen Zhou

JD AI Research { xuewei27,quangang,chao.zhang,dingguohong,xiaodong.he,bowen.zhou } @jd.com Abstract

Statistical signal processing based speech enhancement meth-ods adopt expert knowledge to design the statistical models andlinear ﬁlters, which is complementary to the deep neural net-work (DNN) based methods which are data-driven. In this pa-per, by using expert knowledge from statistical signal process-ing for network design and optimization, we extend the conven-tional Kalman ﬁltering (KF) to the supervised learning scheme,and propose the neural Kalman ﬁltering (NKF) for speech en-hancement. Two intermediate clean speech estimates are ﬁrstproduced from recurrent neural networks (RNN) and linearWiener ﬁltering (WF) separately and are then linearly com-bined by a learned NKF gain to yield the NKF output. Super-vised joint training is applied to NKF to learn to automaticallytrade-off between the instantaneous linear estimation made bythe WF and the long-term non-linear estimation made by theRNN. The NKF method can be seen as using expert knowl-edge from WF to regularize the RNN estimations to improveits generalization ability to the noise conditions unseen in thetraining. Experiments in different noisy conditions show thatthe proposed method outperforms the baseline methods bothin terms of objective evaluation metrics and automatic speechrecognition (ASR) word error rates (WERs).

Index Terms : Speech Enhancement, Neural Kalman ﬁltering,Deep Neural Network

1. Introduction

Speech enhancement aims to suppress the environmental noisewithout distorting the target speech, and its importance has beenwell recognized with the wide applications in communication,hearing aids and automatic speech recognition (ASR). Althoughmultichannel speech enhancement using a microphone arrayhas shown promising advantages over the single-channel pro-cessing, the single-channel speech enhancement is still of greatinterest due to its simple setup in most practical scenarios.Speech enhancement is traditionally treated within theframework of statistical signal processing. Generally, the tar-get is to design a linear ﬁlter based on carefully-chosen sta-tistical models of speech and noise. Typical algorithms in-clude Wiener ﬁltering (WF) [1–3], minimum mean-squared er-ror (MMSE) amplitude estimator [4–6], and maximum a poste-riori (MAP) [7, 8] based methods. As priori expert knowledgehas been adopted to design the statistical models, these methodscan perform fully unsupervised which eliminates the need oftraining data, and the ﬁlter coefﬁcients are computed adaptivelyaccording to the derived analytical formulas and the updatedstatistics of speech and noise. However, since the assumptionsof the models are not always valid, the performance of the sta-tistical signal processing based methods are limited in realisticcases, especially when the noise is non-stationary.The performance of speech enhancement has been dramat-ically improved by using deep neural networks (DNNs). With the capability of modelling non-linear complex transformations,instead of relying on the expert knowledge, the DNN learnsmapping from the noisy speech to the clean target directly ina supervised data-driven manner. Different DNN structureshave been explored, including the feed-forward network (FNN)[9–11], convolutional neural network (CNN) [12–14], as wellas the recurrent neural network(RNN) [15–17].Generalization to the unseen noise type, speaker and trans-mission channel is an essential issue for the DNN based meth-ods. Since the statistical signal processing based methodsare unsupervised and knowledge-driven, some approaches havebeen proposed to exploit the expertised knowledge for speechenhancement [18–20]. Generally, the statistics needed for ﬁlterdesign are estimated from DNN, and then linear ﬁltering is ap-plied to obtain the enhanced speech . However, as the strategy isbasically two-stage, the expert knowledge is not well exploitedin the DNN design and optimization, and the performance oflinear ﬁltering largely depends on the accuracy of the estimatedstatistics.In this paper, we propose a new speech enhancementmethod by fully integrating the statistical signal processing intoDNN. A neural Kalman ﬁlter (NKF) is developed, which ex-tend the conventional Kalman ﬁltering (KF) to the supervisedlearning scheme, and implicitly exploits both the capability ofDNN modelling and the expert knowledge used in signal pro-cessing for speech enhancement. Clean speech estimates fromrecurrent neural networks (RNN) and linear WF are obtained,and are linearly combined by a NKF gain to yield the NKF out-put. By integrating the signal processing components, a net-work structure especially for the speech enhancement purposecan be designed, and such components can serve as regular-ization for the network. In addition, the proposed method alsoovercomes problem of unrealistic model assumption in KF. Weconduct experiments in different noisy conditions, and evalua-tions on both objective speech quality and ASR demonstrate theeffectiveness of the proposed method.

2. Signal Model and Conventional KF

We ﬁrst describe the conventional statistical signal processingbased KF, which will facilitate the introduction of the proposedNKF in the following sections. Compared with other signalprocessing based methods, KF additionally considers the tem-poral evolution of speech into the optimal ﬁlter design, suchthat the artifacts caused by inaccurate noise level estimation issuppressed. The algorithm generally works in the modulationdomain [21–24], which regards the signal amplitude in eachfrequency bin as a time-varying sequence. The clean speechis ﬁrst predicted according to the linear prediction (LP) modelof speech, and then updated by incorporating the instantaneousnoisy observation.Assuming the speech and noise are uncorrelated, the ampli- a r X i v : . [ ee ss . A S ] J u l inear Prediction LP Estimate LP Residual

Noise Estimation

Noise Level

KF Gain Update

KF Gain

Weighting

KF OutputPrevious KF Output Noisy Input

Figure 1:

Diagram of KF based speech enhancement. tude of the noisy speech is expressed by: | Y ( t, f ) | = | X ( t, f ) | + | V ( t, f ) | , (1)where Y ( t, f ) , X ( t, f ) , and V ( t, f ) represent the short-timeFourier transform (STFT)-domain signals of the noisy speech,clean speech and noise, respectively. By modelling the cleanspeech amplitude as an auto-regressive (AR) process, | X ( t, f ) | can be further expressed by a P -order LP model, and in thematrix form, we have: x ( t, f ) = A ( f ) x ( t − , f ) + u W ( t, f ) , (2)where x ( t, f ) = [ | X ( t, f ) , | X ( t − , f ) | , ..., | X ( t − P +1 , f ) | ] T is the hidden state vector of KF, A ( f ) is the state trans-mission matrix deﬁned in [21] according to the LP coefﬁcients, u = [1 , , ..., T is a P × vector, and W ( t, f ) is the LPresidual. In practice, the unknown LP coefﬁcients are estimatedvia LP analysis on the output of WF.As shown in Fig. 1, the KF based speech enhancement hastwo stages: predicting and updating. Given the hidden state x ( t − | t − , f ) that consists of the clean speech estimatesin the previous frame, the LP estimation of the clean speech x ( t | t − , f ) is ﬁrst obtained using the LP model (2), as: x ( t | t − , f ) = A ( f ) x ( t − | t − , f ) . (3)The estimation is then updated by incorporating the noisy ob-servation in the current frame, using a Kalman gain G ( t, f ) ,as ˆ x ( t | t, f )= [ I − G ( t, f ) u T ]ˆ x ( t | t − , f ) + G ( t, f ) | Y ( t, f ) | , (4)where the G ( t, f ) is determined by comparing between thenoise variance σ v = E { V ( t, f ) V ∗ ( t, f ) } and the variance ma-trix of the LP residual R ee ( t | t − , f ) , as G ( t, f ) = R ee ( t | t − , f ) u σ v + u T R ee ( t | t − , f ) u , (5)such that the KF output ˆ x ( t | t, f ) approximates the LP estima-tion in strong noise cases and rely more on the observation oth-erwise. The statistics of LP residual can be updated accordingto the LP model and the KF output, and the details that canbe found in [21–24] are omitted here for simplicity. It is beenshown by Xue et.al [25–27] that the KF becomes the WF if theLP information is not exploited, thus KF can be seen as a com-bination of temporal LP and instantaneous linear ﬁltering.

3. Proposed Method

Although the speech evolution is exploited into the design ofconventional signal processing based KF, the statistical mod-els, for instance, the modulation-domain additive signal modelin (1) and the LP model of speech evolution in (2), may notbe appropriate to represent different realistic noisy conditions.An external estimator is also required to provide the noise levelestimation in Fig. 1, which, in the context of statistical signalprocessing, is usually based on the unrealistic stationary noiseassumption.The above shortcomings of the conventional KF can beovercome by exploiting DNNs to learn the complex modelsfrom data. A novel NKF is proposed by extending the conceptof conventional KF to the supervised learning scheme. Dif-ferent with conventional DNN-based methods which typicallylearn end-to-end non-linear mappings from the noisy signal tothe clean target, we integrate the statistical signal processinginto the network, and use the KF’s “predict-update” scheme tocontrol the behaviour of the network more effectively. The sta-tistical signal processing components can be seen as providingpriori expert knowledge to the network, and serve as an regular-ization for network optimization.

LSTM Prediction

Network

LSTM Prediction

Prediction

Residual

Noise Estimation Network

Noise Level

NKF Gain Network

NKF Gain

Linear Weighting

NKF OutputNoisy Feature Set B

Linear Wiener

Filtering

Winer Filter OutputNoisy Feature Set A

Figure 2:

Diagram of the proposed NKF.

The diagram of the proposed NKF is shown in Fig. 2, whichis similar to the KF, and consists of a) long short-term memory(LSTM) prediction, b) linear WF, and c) linear weighting be-tween the LSTM prediction and WF output based on a learnedNKF gain. Details will be described in the following subsec-tions.

The LSTM has strong capability of sequential modelling andhas shown superior performances in noise reduction [15–17]. ALSTM prediction network is constructed as in Fig. 3, which, notonly predicts the clean speech amplitude, but also estimates theprediction residual from the noisy input, in accordance with theKF framework.With the notations in Section 2, in each frame t , an F × feature vector is formed using the noisy amplitudes in different STMLSTM LSTMLSTM LSTMLSTMFully Connected

Fully

Connected Fully Connected

Fully

Connected Fully Connected

Fully

ConnectedClean

Amplitude

Prediction

Residual

Clean Amplitude Prediction Residual Clean Amplitude Prediction ResidualNoisy Amplitude Noisy Amplitude Noisy Amplitude （ 𝑡 − 1 ）（ 𝑡 ）（ 𝑡 + 1 ） Figure 3:

Structure of LSTM prediction network. frequency bins as y ( t ) = [ | Y ( t, | , | Y ( t, | , ..., | Y ( t, F ) | ] T , (6)where F is the number of frequency bins. A sequence of thefeature vectors are ﬁrst fed into the LSTM layers to model thetemporal evolution of speech. Then in each time step, by uti-lizing two separate fully-connected output layers, outputs fromLSTM layers are transformed into the clean amplitude predic-tion and the prediction residual, respectively.We note that unlike the conventional KF that relies solelyon the previous ﬁlter output to perform LP, the LSTM predic-tion network use the noisy amplitude spectrum as input. This isbecause we believe that the speech evolution has already beenmodelled by the hidden state propagation in the LSTM, andthe additional noisy observation in the each frame can help theLSTM to achieve more accurate prediction. The LSTM prediction will be updated by combining with theoutput of linear WF. Based on the signal model in (1), in eachTF bin, under the MMSE criterion the optimal Wiener ﬁlter H Wiener ( t, f ) is given by H Wiener ( t, f ) = σ x ( t, f ) σ x ( t, f ) + σ v ( t, f ) = σ y ( t, f ) − σ v ( t, f ) σ y ( t, f ) , (7)where σ y ( t, f ) = E {| Y ( t, f ) | } is the variance of noisyspeech, σ x ( t, f ) and σ v ( t, f ) are deﬁned similarly on cleanspeech and noise, respectively. Then the clean speech ampli-tude is obtained simply as | ˆ X ( t, f ) | Wiener = H Wiener ( t, f ) | Y ( t, f ) | . (8)Practically σ v ( t, f ) is unknown and can conventionally be es-timated by algorithms such as [28–30]. σ y ( t, f ) can be com-puted by averaging the noisy power spectrum over the past fewframes.Here we integrate the above linear ﬁltering process into theproposed NKF framework. Since the noise variance σ v ( t, f ) is unknown, a noise estimation network is ﬁrst constructedby taking the amplitude vector y ( t ) and the variance vector Γ y ( t ) = [ σ y ( t, , ..., σ y ( t, F )] T as inputs, and outputs thenoise variances in different frequency bins. The noise estima-tion network is a ReLu-activated FNN, and uses a left-side con-text window to concatenate the input features to capture theshort-term dependencies. As shown in Fig. 2, once the noisevariance is known, the clean speech amplitude is calculated by(7) and (8). The clean amplitude estimations from the LSTM predictionnetwork | ˆ X ( t, f ) | LSTM and WF are ﬁnally | ˆ X ( t, f ) | Wiener com-bined by linear weighting to yield the NKF output | ˆ X ( t, f ) | NKF .Similar to (4), the weighting is controlled by an NKF gain G NKF ( t, f ) : | ˆ X | NKF = G NKF ∗ | ˆ X | Wiener + (1 − G NKF ) ∗ | ˆ X | LSTM , (9)where the TF bin index “ ( t, f ) ” is omitted for simplicity.The Kalman gain in (5) trade-offs between the LSTM pre-diction and WF output, and can actually be designed more ﬂex-ibly according to the speech dominance in each TF bin [27]. Inthe proposed method, the NKF gain is computed from a NKFgain network. The network takes the concatenation of LSTMprediction residual and noise variance as input and outputs thelinear weight within the range [0 , . The structure of the net-work is a FNN with ReLU activation in the hidden layers, andSigmoid activation in the output layer. Since the integrated linear ﬁltering components are all differ-entiable, the proposed NKF can be directly optimized throughback propagation (BP). We choose the amplitude spectrum ofthe clean speech as target and train the network under themean-squared error (MSE) criterion. The network is trainedin the sequence-to-sequence manner, and as a whole, the noisyfeatures consist of the noisy amplitude spectrum in differentframes, the left-side contexts and variances of the amplitudespectrum in each frame.Once the clean amplitude spectrum is obtained, the time-domain speech is recovered by inverse STFT which uses thephase spectrum of the noisy speech.

4. Experiment

The experiments are conducted using the Librispeech [31]corpus which contains clean speech, the PNL-100Nonspeech-Sounds (PNL) [32] corpus which contains 100 types of noisethat are mostly non-stationary, as well as the noise subset of theMUSAN corpus (MUSAN-Noise) [33].We create unmatched noisy conditions for training and test-ing. An 300-hour training set is prepared by repeatedly ﬁrstpicking up a pair of clean speech and noise signals from the“CLEAN-360” subset of Librispeech and PNL dataset, and thenmixing them with a speech-to-noise ratio (SNR) level randomlychosen from {− } dB. The test set contains all ut-terances of the “TEST-CLEAN” and “TEST-OTHER” subsetsof Librispeech that are corrupted by 20 types of noise sig-nals randomly selected from MUSAN-Noise at SNR levels of {− } dB.The sample rate of all speech and noise signals is kHz,and the analysis window for STFT and feature extraction is samples with overlap. The LSTM prediction network ofthe proposed method has two 1024-node hidden LSTM layers,and both the noise estimation and the NKF gain network havethree layers with 1024 nodes in the hidden layer. The left-sidecontext window for the noise estimation network is and the σ y ( t, f ) in (8) is computed using previous frames.We compare with the proposed method with a) LSTM-2Lwhich has the same structure with the LSTM prediction net-work of the proposed method, except that the prediction resid- I S D Noisy WF KFLSTM-2L LSTM-4L NKF F w S e g S N R -5 0 5 10 15 SNR (dB) P E S Q Figure 4:

Objective evaluation of speech enhancement perfor-mances on the TEST-CLEAN subset of Librispeech.

Table 1:

ASR performances in WER (%) on the TEST-CLEANsubset of Librispeech

SNR (dB) − Noisy . . . . . WF . . . . . KF . . . . . LSTM-2L . . . . . LSTM-4L . . . . . NKF . . . . . ual output is discarded; b) LSTM-4L which is the variation ofLSTM-2L with four hidden LSTM layers, such that a largerLSTM network is compared with since additional componentsare integrated into NKF; c) WF; and d) KF. For fair comparison,the WF and KF also use the noise estimation from a separately-trained DNN which has the same structure with the noise esti-mation network of the proposed method. During training, thebatch size of all methods are , and the sequence length ofthe proposed method and the LSTM based-baselines is frames. All networks are trained with epochs.The speech enhancement performances of different meth-ods are evaluated in terms of objective speech quality mea-sures and word error rate (WER) of ASR. We use Itakura-Saito distance (ISD) [34], frequency-weighted segmental SNR(FwSegSNR) [35] and perceptual evaluation of speech quality(PESQ) [36] as objective metrics. The ASR is evaluated us-ing ESPNET [37] with an ofﬁcially-released pretrained model(large Transformer with SpecAug and Transformer languagemodel) on the Librispeech dataset. The objective evaluation results of different methods in differ-ent SNRs are depicted in Fig. 4 and Fig. 5. It can be seenthat the proposed NKF can consistently yield the lowest ISDand FwSegSNR in different scenarios, indicating that the pro-posed method has the best performance in suppressing noisewhile controlling speech distortion. In almost all cases, the pro- I S D Noisy WF KFLSTM-2L LSTM-4L NKF F w S e g S N R -5 0 5 10 15 SNR (dB) P E S Q Figure 5:

Objective evaluation of speech enhancement perfor-mances on the TEST-OTHER subset of Librispeech.

Table 2:

ASR performances in WER (%) on the TEST-OTHERsubset of Librispeech

SNR (dB) − Noisy . . . . . WF . . . . . KF . . . . . LSTM-2L . . . . . LSTM-4L . . . . . NKF . . . . . posed method achieves comparable PESQ with the LSTM-2L.By comparing with LSTM-2L and LSTM-4L, we can noticethat simply using a larger model does not necessarily improvethe performance, and may instead cause the over-ﬁtting prob-lem which degrades the performance in unmatched noise con-ditions.The superiority of the proposed NKF is also demonstratedby the ASR performances, which are summarized in Tab. 1 andTab. 2. In all SNR conditions of both TEST-CLEAN subset,the utterances after NKF noise reduction have the lowest WER.On the TEST-CLEAN subset, the proposed NKF improves theLSTM-2L by 12.14% relative WER reduction when SNR is − dB, and reduces the WER by 53.45% compared with theraw noisy signal in the dB condition. Similar conclusions canbe drawn from the performances on the TEST-OTHER subset.

5. Conclusion

A NKF based speech enhancement method is proposed by in-tegrating the DNN with the statistical signal processing, fol-lowing the framework of conventional KF. The statistical signalprocessing components can be seen as providing priori expertknowledge and serve as an regularization for the network. Ex-perimental results in different noisy conditions show the effec-tiveness of the proposed method both in objective speech qual-ity evaluation and ASR. . References [1] N. Wiener,

The Extrapolation, Interpolation and Smoothing ofStationary Time Series . New York, NY, USA: John Wiley &Sons, Inc., 1949.[2] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidthcompression of noisy speech,”

Proc. IEEE , vol. 67, no. 12, pp.1586–1604, Dec. 1979.[3] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights intothe noise reduction Wiener ﬁlter,”

IEEE Trans. Audio, Speech,Lang. Process. , vol. 14, no. 4, pp. 1218–1234, Jul. 2006.[4] Y. Ephraim and D. Malah, “Speech enhancement using aminimum-mean square error short-time spectral amplitude esti-mator,”

IEEE Trans. Acoust., Speech, Signal Process. , vol. 32,no. 6, pp. 1109–1121, Dec. 1984.[5] ——, “Speech enhancement using a minimum mean-square errorlog-spectral amplitude estimator,”

IEEE Trans. Acoust., Speech,Signal Process. , vol. 33, no. 2, pp. 443–445, 1985.[6] R. Martin, “Speech enhancement using MMSE short time spectralestimation with gamma distributed speech priors,” in

Proc. IEEEIntl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) ,2002.[7] T. Lotter and P. Vary, “Speech enhancement by MAP spec-tral amplitude estimation using a super-gaussian speech model,”

EURASIP J. on Applied Signal Processing , pp. 1110–1126, Jan.2005.[8] P. J. Wolfe and S. J. Godsill, “Efﬁcient alternatives to theEphraim and Malah suppression rule for audio signal enhance-ment,”

EURASIP J. on Applied Signal Processing , vol. 2003,no. 10, pp. 1043–1051, Sep. 2003.[9] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental studyon speech enhancement based on deep neural networks,”

IEEESignal Processing Letters , vol. 21, no. 1, pp. 65–68, jan 2014.[10] C. Sch¨uldt and P. H¨andel, “Decay rate estimators and their perfor-mance for blind reverberation time estimation,”

IEEE/ACM Trans.Audio, Speech, Lang. Process. , vol. 22, no. 8, pp. 1274–1284,Aug. 2014.[11] M. Tu and X. Zhang, “Speech enhancement based on deep neu-ral networks with skip connections,” in . IEEE, mar 2017.[12] N. Mamun, S. Khorram, and J. H. Hansen, “Convolutional neuralnetwork-based speech enhancement for cochlear implant recip-ients,” in

Proc. Conf. of Intl. Speech Commun. Assoc. (INTER-SPEECH) , 2019.[13] Z. Ouyang, H. Yu, W.-P. Zhu, and B. Champagne, “A fully con-volutional neural network for complex spectrogram processingin speech enhancement,” in

ICASSP 2019 - 2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, may 2019.[14] S. R. Park and J. W. Lee, “A fully convolutional neural networkfor speech enhancement,” in

Interspeech 2017 . ISCA, aug 2017.[15] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux,J. R. Hershey, and B. Schuller, “Speech enhancement withLSTM recurrent neural networks and its application to noise-robust ASR,” in

Latent Variable Analysis and Signal Separation .Springer International Publishing, 2015, pp. 91–99.[16] T. Gao, J. Du, L.-R. Dai, and C.-H. Lee, “Densely connectedprogressive learning for LSTM-based speech enhancement,” in

Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Process-ing (ICASSP) . IEEE, apr 2018.[17] L. Sun, J. Du, L.-R. Dai, and C.-H. Lee, “Multiple-targetdeep learning for LSTM-RNN based speech enhancement,” in . IEEE, 2017.[18] X. Wang, Z. Wang, X. Li, Y. Na, Q. Fu, and Y. Yan, “LSTM net-work supported linear ﬁltering for the CHiME 2016 challenge,” in

CHiME-4 workshop , 2016. [19] B. Xia and C. Bao, “Wiener ﬁltering based speech enhancementwith weighted denoising auto-encoder and noise classiﬁcation,”

Speech Communication , vol. 60, pp. 13–29, may 2014.[20] Y. Yang and C. Bao, “DNN-based Ar-Wiener ﬁltering for speechenhancement,” in

Proc. IEEE Intl. Conf. on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, apr 2018.[21] S. So and K. K. Paliwal, “Modulation-domain Kalman ﬁlteringfor single-channel speech enhancement,”

Speech Communication ,vol. 53, no. 6, pp. 818–829, Jul. 2011.[22] Y. Wang and M. Brookes, “Speech enhancement using a robustKalman ﬁlter post-processor in the modulation domain,” in

Proc.IEEE Intl. Conf. on Acoustics, Speech and Signal Processing(ICASSP) , Vancouver, May 2013, pp. 7457–7461.[23] N. Dionelis and M. Brookes, “Speech enhancement usingmodulation-domain Kalman ﬁltering with active speech level nor-malized log-spectrum global priors,” in

Proc. European SignalProcessing Conf. (EUSIPCO) , 2017.[24] Y. Wang and M. Brookes, “Speech enhancement using a modula-tion domain Kalman ﬁlter post-processor with a Gaussian mixturenoise model,” in

Proc. IEEE Intl. Conf. on Acoustics, Speech andSignal Processing (ICASSP) , 2014, pp. 7024–7028.[25] W. Xue, A. H. Moore, M. Brookes, and P. A. Naylor, “Multichan-nel Kalman ﬁltering for speech enhancement,” in

Proc. IEEE Intl.Conf. on Acoustics, Speech and Signal Processing (ICASSP) , Cal-gary, Canada, Apr. 2018.[26] ——, “Modulation-domain multichannel Kalman ﬁltering forspeech enhancement,”

IEEE/ACM Trans. Audio, Speech, Lang.Process. , vol. 26, no. 10, pp. 1833–1847, 2018.[27] ——, “Modulation-domain parametric multichannel Kalman ﬁl-tering for speech enhancement,” in

Proc. European Signal Pro-cessing Conf. (EUSIPCO) , Roma, Italy, Sep. 2018.[28] S. Rangachari and P. C. Loizou, “A noise-estimation algorithmfor highly non-stationary environments,”

Speech Communication ,vol. 48, no. 2, pp. 220–231, Feb. 2006.[29] I. Cohen and B. Berdugo, “Noise estimation by minima controlledrecursive averaging for robust speech enhancement,”

IEEE SignalProcess. Lett. , vol. 9, no. 1, pp. 12–15, Jan. 2002.[30] T. Gerkmann and R. C. Hendriks, “Noise power estimation basedon the probability of speech presence,” in

Proc. IEEE Work-shop on Applications of Signal Processing to Audio and Acoustics(WASPAA) , New Paltz, NY, USA, Oct. 2011, pp. 145–148.[31] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audio books,”in

Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Pro-cessing (ICASSP) . IEEE, apr 2015.[32] G. Hu and D. Wang, “A tandem algorithm for pitch estimationand voiced speech segregation,” vol. 18, no. 8, pp. 2067–2079,nov 2010.[33] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech,and Noise Corpus,” 2015, arXiv:1510.08484v1.[34] S. R. Quackenbush, T. P. Barnwell, III, and M. A. Clements,

Ob-jective Measures of Speech Quality . Prentice Hall, Jan. 1988.[35] Y. Hu and P. C. Loizou, “Evaluation of objective measures forspeech enhancement,” in

Proc. Conf. of Intl. Speech Commun. As-soc. (INTERSPEECH) , 2006, pp. 1447–1450.[36] I. T. U. (ITU-T), “Perceptual evaluation of speech quality (PESQ),an objective method for end-to-end speech quality assessmentof narrowband telephone networks and speech codecs,” Intl.Telecommunications Union (ITU-T), Recommendation P.862,Feb. 2001.[37] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba,Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen,A. Renduchintala, and T. Ochiai, “Espnet: End-to-end speechprocessing toolkit,”