[PDF] Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks

Abstract

Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contrast, spatial clustering techniques can achieve such generalization but lack a strong signal model. Our work proposes a combination of the two approaches. By using LSTMs to enhance spatial clustering based time-frequency masks, we achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance and generality of multi-channel spatial clustering. We compare our proposed system to several baselines on the CHiME-3 dataset. We evaluate the quality of the audio from each system using SDR from the BSS\_eval toolkit and PESQ. We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.

Full PDF

EENHANCEMENT OF SPATIAL CLUSTERING-BASED TIME-FREQUENCY MASKS USINGLSTM NEURAL NETWORKS

Felix Grezes , Zhaoheng Ni , Viet Anh Trinh , Michael Mandel The Graduate Center, City University of New York Brooklyn College, City University of New York { fgrezes,zni,vtrinh } @gradcenter.cuny.edu, [email protected] ABSTRACT

Recent works have shown that Deep Recurrent Neural Networksusing the LSTM architecture can achieve strong single-channelspeech enhancement by estimating time-frequency masks. How-ever, these models do not naturally generalize to multi-channelinputs from varying microphone conﬁgurations. In contrast, spa-tial clustering techniques can achieve such generalization but lacka strong signal model. Our work proposes a combination of thetwo approaches. By using LSTMs to enhance spatial clusteringbased time-frequency masks, we achieve both the signal model-ing performance of multiple single-channel LSTM-DNN speechenhancers and the signal separation performance and generality ofmulti-channel spatial clustering. We compare our proposed systemto several baselines on the CHiME-3 dataset. We evaluate the qual-ity of the audio from each system using SDR from the BSS evaltoolkit and PESQ. We evaluate the intelligibility of the output ofeach system using word error rate from a Kaldi automatic speechrecognizer.

Index Terms — Speech Enhancement, Microphone Array,LSTM, Spatial Clustering, Beamforming

1. INTRODUCTION

With speech recognition techniques approaching human perfor-mance on noise-free audio with a close-talking microphone [1],recent research has focused on the more difﬁcult task of speechrecognition in far-ﬁeld, noisy environments. This task requiresrobust speech enhancement capabilities.One approach to speech enhancement is spatial clustering,which groups together spectrogram points coming from the samespatial location [2]. This information can be used to drive beam-forming, which linearly combines multiple microphone channelsinto an estimate of the original signal that is optimal under sometest-time criterion [3]. This optimality is typically based on prop-erties of the signals or the spatial conﬁguration of the recordings attest time, with no training ahead of time.Another approach is to use a signal models trained using neu-ral networks. Recent work on deep recurrent neural networks us-ing the LSTM architecture [4] can achieve signiﬁcant single-channelnoise reduction [5, 6], and so there is interest in using trainable deep-learning models to perform beamforming. This is especially usefulfor optimizing beamformers directly for automatic speech recogni-tion [7, 8], although such optimization must happen at training timeon a large corpus of training data. Such models have difﬁculty gen-eralizing across microphone arrays, including differences in numberof microphones and array geometries, such as occurs between theAMI corpus [9, 10] and the CHIME challenge [11]. In contrast to deep learning-based beamforming, spatial cluster-ing is an unsupervised method for performing source separation, soit easily adapts across microphone arrays [12, 13, 14]. Such methodsgroup spectrogram points based on similarities in spatial properties,but are typically not able to take advantage of signal models, such asmodels of speech or noise.Developed by Mandel et al. [12], Model-based EM Source Sep-aration and Localization (MESSL) is a system that computes time-frequency spectrogram masks for source separation as a byproductof estimating the spatial location of the sources. It does so usingthe expectation maximization (EM) algorithm, iteratively reﬁningthe estimates of the spatial parameters of the audio sources and thespectrogram regions dominated by each source.While MESSL utilizes spatial information to separate multiplesources, it does not model the content of the original signals. Thisis an advantage when separating unknown sources, but performancecan be improved when a model of the target source is available. Thegoal of this paper is to augment the capabilities of MESSL by addinga speech signal model based on neural networks trained to enhancethe masks produced by MESSL.In this paper we describe a novel method of combining single-channel LSTM-RNN-based speech enhancement into MESSL. Wetrain a distinct LSTM model that uses the single-channel noisy au-dio to enhance the masks produced my MESSL. To show how thesemethods enhance the speech of the CHiME-3 outdoor 6-channel au-dio, compared to baselines, we report the enhancement performancemeasured by PESQ score [15], the SDR, SIR, and SAR scores fromBSS Eval toolkit [16], as well as the WER as reported by the Kalditoolkit [17] trained on a separate corpus, the indoor 8-channel AMIcorpus.

2. RELATED WORK

Recently, Nugraha et al. [18] also studied multi-channel sourceseparation using deep feedforward neural networks, using a multi-channel Gaussian model to combine the source spectrograms, to takeadvantage of the spatial information present in the microphone array.They explore the efﬁcacy of different loss functions and other modelhyper-parameters. One of their ﬁndings is that the standard mean-square error loss function performed close to the best. In contrast toour work they do not use spatial information that beamforming cangive.Pfeifenberger et al. [19] proposed an optimal multi-channel ﬁl-ter which relies solely on speech presence probability. This speech-noise mask is predicted using a 2-layer feedforward neural networkusing features based on the leading eigenvector of the spatial co-variance matrix of short time segments. Using a single eigenvector a r X i v : . [ c s . S D ] D ec able 1 . Training Targets and their Associated Loss FunctionTraining Targets Loss FunctionsIdeal Amplitude (IA) Masks m ia ( ω, t ) = | s ( ω, t ) | / | y ( ω, t ) | Binary Cross EntropyPhase Sensitive (PA) Masks m ps ( ω, t ) = cos( θ ω,t ) | s ( ω,t ) || y ( ω,t ) | Binary Cross EntropyMagnitude Spectrum (MS) Approximation m ma ( ω, t ) = | s ( ω, t ) | Mean-Squared ErrorPhase-sensitive Spectrum (PS) Approximation m pa ( ω, t ) = cos( θ ω,t ) | s ( ω, t ) | Mean-Squared Error

Model-based Source SeparationModel

Mask-drivenMVDR Beamforming BeamformedSpectrogram Enhanced SpeechLSTMMask-enhancer

STFT

Combined Mask

Post Filter

Fig. 1 . Multi-channel Spatial Clustering Based Time-Frequency Mask Enhancement Systemmakes the input to the DNN independent of the number of micro-phones, and thus adaptable to new microphone conﬁgurations. It istrained on the simulated noisy data portion of CHIME-3. They showthat this ﬁlter improves the PESQ score of the audio. This approachuses an early fusion of the microphone channels before they are pro-cessed by the DNN, as opposed to our late fusion after the DNNprocesses each channel.Heymann et al [20, 21] also study the combination of multi-channel beamforming with single-channel neural network model.Similar to ours, the proposed model consists of a bidirectional LSTMlayer, followed by feedforward layers, in their case three. Of partic-ular note is the companion paper by Boeddeker et al. [22], whichderives the derivative of an eigenvalue problem involving complex-valued eigenvectors, allowing their system to propagate errors in theﬁnal SNR through the beamforming and back to the single-channelDNNs. While we do not optimize our system in this end-to-end man-ner, the combination of MESSL with the per-channel DNNs mayprovide advantages in modeling the spatial information.This paper builds upon previous work by the authors in [23] and[24], in which we propose two other methods of improving MESSL:a naive combination of the MESSL mask with the masks producedby a LSTM network trained to enhance noisy spectrograms, andusing the LSTM-based masks to initialize the EM algorithm of MESSL. This previous work also describes a novel supervised-mvdr beamforming technique to obtain cleaner references for theCHiME-3 dataset.

3. METHODS3.1. Training the Networks to Enhance the MESSL Masks

To improve the quality of the binary masks produced by MESSL,we trained LSTM neural networks to enhance a MESSL mask whenpassing this mask and its associated noisy spectrogram through thenetwork. We tested four different training targets: ideal amplitude(IA) masks, phase sensitive (PA) masks, magnitude spectrum (MS)approximations and phase-sensitive (PS) spectrum approximations ,based on work by Erdogan et al. [25], as shown in Table 1.The LSTM operates on single-channel recordings. Each chan-nel in the multi-channel recording is processed independently and inparallel by the LSTM, following [26]. In the single-channel setting,the short-time Fourier transform of the recorded noisy signal, y ( ω, t ) is assumed to be y ( ω, t ) = s ( ω, t ) + n ( ω, t ) (1)where s ( ω, t ) is the (possibly reverberant) target speech and n ( ω, t ) ig. 2 . An example of the MA-trained Mask-Enhancer net-work using the noisy spectrogram from channel 1 in utteranceF01 050C0103 BUS in the development set and its MESSL maskto produce an enhanced time-frequency mask.is non-stationary additive noise. For the purposes of deﬁning thetargets and cost functions in Table 1, let θ ω,t = (cid:54) s ( ω, t ) − (cid:54) y ( ω, t ) (2)i.e., the phase difference between the target clean spectrogram andthe input noisy spectrogram. In each case the network was conﬁg-ured to output a [0 , valued mask ˆ m ( ω, t ) for each frame of theinput noisy spectrogram.For the masks targets, the network was trained to minimize thebinary cross-entropy loss, while for the spectrum approximationstargets the network was trained to minimize the mean-squared error.While in theory phase-sensitive masks may have negative values,causing problems with the cross-entropy loss function, in practicethese were rare enough that we simply clipped those values to be 0.For each training target type, we explored various hyper-parameter combinations: single or double bi-directional LSTMlayers of size 256, 512, 1024 or 2048; merging of the bi-directionalforward and backward outputs by summing, multiplying, averagingor concatenating; using a sigmoid or hard sigmoid (a piece-wiselinear approximation of sigmoid that is faster to compute) for theoutput layer activation function. The exploration was done by ran-domly generating a network from the above 64 combinations andtraining it until the loss on the development set no-longer improved.For each training type, we report our best conﬁguration in Table 2.The spectrogram inputs were converted from a linear to decibelscale, and normalized to mean 0 and variance 1 at each frequencybin. The MESSL binary masks were passed through the logit func-tion. To perform the computation and training of our LSTM neuralnetworks, we used the KERAS python library [27], built upon theTensorﬂow library [28].Figure 2 gives an example of how one of our networks haslearned how to use the noisy spectrogram to reﬁne a mask producedby MESSL. A ﬂowchart illustrating the framework of our methods is shown inFigure 1. We extract six spectrograms from six-channel audio ﬁlesusing short-time Fourier transform (STFT). The window size is 1024(64ms at 16kHz). We then use one of our models described aboveto enhance the mask produced by MESSL, using the six differentchannel spectrograms. Those six enhanced masks are combined intoone by taking the maximum. We tried different ways of combiningthe MESSL mask and LSTM enhanced mask (average, maximum,minimum, or LSTM output only) into a ﬁnal mask. A comparison ofthese combination methods is given in Table 3 Then we use this ﬁnalmask to estimate noise spatial covariances and perform mask-drivenMVDR beamforming. We apply the same mask as a post ﬁlter ontothe corresponding beamformed spectrogram and get the enhancedaudio using the inversed short-time Fourier transform. This audio isthen used to evaluate the quality of the model using the PESQ, SDRand WER metrics.

4. EXPERIMENTS AND RESULTS4.1. The CHiME-3 Corpus

The CHiME-3 corpus features both live and simulated, 6-channelsingle speaker recordings from 12 different speakers (6 male, 6 fe-male), in 4 different noisy environments: caf´e, street junction, publictransport and pedestrian area. In our work, we used the ofﬁcial datasplit, with 1600 real noisy utterances in the training set for train-ing, 1640 real noisy utterances in the development set for valida-tion. We did not use the simulated data to train our models. Wetested our models on the proposed 2640 utterances in the test set,which contains audio both from real noisy recordings and simulatednoisy recordings. In order to perform speech recognition, we usedthe Kaldi toolkit trained on the AMI corpus, which features 8 mi-crophones, recording overlapping speech in meeting rooms. Thesedifferences provide an additional challenge, but are essential to eval-uating the generalization abilities of our model.

Because the real subset of the CHiME-3 recordings were spoken in anoisy environment, it is not possible to provide a true clean referencesignal for them. Instead, an additional microphone was placed closeto the talker’s mouth to serve as a reference. While this referencehas a higher signal-to-noise ratio than the main microphones, it isnot noise free. In addition, because it is mounted close to the mouth,it contains sounds that are not desired in a clean output and actuallycould hurt ASR performance, namely pops, lip smacks, and othermouth noises. In order to obtain a cleaner reference signal, we usethe close mic signal as a frequency-dependent voice activity detectorto control the MVDR beamforming of the signals from the arraymicrophones, as described in [24].

We evaluate the performance of our enhancement system in termsof both speech quality and intelligibility to a speech recognizer.For quality, we use the Signal-to-Distortion Ratio (SDR) from theBSS Eval toolkit [16] and the Perceptual Evaluation of SpeechQuality (PESQ) score [29]. PESQ is measured in units of meanopinion score (MOS) between 0 and 5, higher being better. ForSDR, we used the source-based (not spatial image-based) scoring.For the simulated data, the reference signals were given by the booth able 2 . Best hyper-parameter settings found for each training tar-get. Multiple layer sizes indicates multiple layers.TrainingTarget Type Size ofLSTM Layer(s) Bi-directionMerge Mode OutputActivationIA 512 average hard sigmoidPS (512,1024) concatenation sigmoidMA 512 average hard sigmoidPA (512, 2048) concatenation hard sigmoidrecordings of CHiME-3. For the real data, the reference signalswere given by the supervised MVDR for the target speech, and theapproximation of the noise signals of the individual microphonechannels were computed by subtracting the reference. Since wehave no real ground truth audio for the real dataset, the SDR scoresreported on that dataset should be taken with a grain of salt. SDRis measured in decibels, with higher values being better. PESQis fairly accurate at predicting subjective quality scores for speechenhancement, but has the advantage for CHiME-3 of not requiringa reference for the noise sources. The supervised MVDR signalserved as the speech reference for PESQ.We also evaluate the enhanced speech by Word Error Rate(WER) using Kaldi automatic speech recognizer. We train our Kaldirecognizer on AMI corpus. The training and test sets differ signif-icantly in the number of microphones, array geometry, amount ofreverberation, microphone array distance, amount and type of noise,speaking style, and vocabulary[30]. After the training and testingsetup, we can evaluate the performance of our enhancement systemin reducing the mismatch between training and testing data. WER ismeasured in percent, with lower values being better.

As a baseline, we use our own implementation of the method ofSouden et al. [31]. This approach generalizes improved minima-controlled recursive averaging [32] to multichannel signals to esti-mate the speech presence probability. This speech presence prob-ability is then used to estimate the spatial covariance matrix of thenoise, which is used to compute an MVDR beamformer.

As detailed in section 3.1, we explored various architectures for eachtraining target. We report the best architecture conﬁgurations in Ta-ble 1, as measured by the loss on the CHiME-3 dev set.As detailed in section 3.2, we then tried various methods of com-bining the enhanced masks with the MESSL masks, from the bestnetworks for each training target. As shown in Table 3, we foundthat averaging the enhanced masks given by the model trained onideal-amplitude targets produced the best results on the dev set.Finally, we fully evaluated our best model (IA, Avg) using thePESQ, SDR and WER metrics. The comparison to the baselines isshown in Tables 4 and 5. Compared to the method of Souden etal. [31], our mask-enhancer method achieves better scores across allthree metrics and on both the dev and test dataset, over both the realand simulated data set, with one exception for the SDR score over thesimulated test set. Compared to MESSL, our method improves thePESQ scores over the dev and test dataset, while achieving similarWER scores.

Table 3 . Comparison of WER for different methods of combiningthe enhanced mask with the MESSL mask, using the best performingmodel for each training target. (see Table 2).Avg Min Max LSTMIA Dev

Table 4 . Results of PESQ and SDR comparing the best perform-ing system from Table 3 with several baselines, over the simulatedportion of the CHiME-3 eval and test data.PESQ (MOS) SDR (dB)(SIMU data) Dev Test Dev TestMESSL [12] 3.18 3.10 6.00 2.38Souden [31] 2.31 2.44 3.92 4.35

Mask Enhancer

Table 5 . Results of all metrics comparing the best performing sys-tem from Table 3 with several baselines, over the real portion of theCHiME-3 eval and test data.PESQ (MOS) SDR (dB) WER (%)(REAL data) Dev Test Dev Test Dev TestMESSL [12] 2.65 2.37 6.42 5.38 19.7 32.3Souden [31] 2.14 2.05 3.21 2.37 37.4 52.3

Mask Enhancer

5. CONCLUSION AND FUTURE WORK

In this paper we propose a novel method to adapt parallel single-channel LSTM-based enhancement to multi-channel audio, combin-ing the speech-signal modeling power of the LSTM neural networkwith the spatial clustering power of MESSL, further enhancing theaudio. We show that this method can help MESLL improve the qual-ity of audio, with similar intelligibility.Our future work will continue to explore different ways of in-tegrating the LSTM speech-signal model with MESSL. Preliminaryresults show that the spatial information is more valuable than thesingle-channel speech information with respect to WER. To furthertest this hypothesis,the next step is to integrate a mask cleaningLSTM model in each loop of MESSL’s EM algorithm, i.e use themask enhancer model to clean the MESSL masks before the estima-tion of the spatial parameters.

6. ACKNOWLEDGEMENTS

This material is based upon work supported by the Alfred P Sloanfoundation and the National Science Foundation (NSF) under GrantNo. IIS-1409431. Any opinions, ﬁndings, and conclusions or rec-ommendations expressed in this material are those of the author(s)and do not necessarily reﬂect the views of the NSF. . REFERENCES [1] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, MikeSeltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “Achiev-ing human parity in conversational speech recognition,” Tech. Rep.,February 2017.[2] Michael I Mandel, Shoko Araki, and Tomohiro Nakatani, “Multichan-nel clustering and classiﬁcation approaches,” in

Audio Source Separa-tion and Speech Enhancement , Emmanuel Vincent, Tuomas Virtanen,and Sharon Gannot, Eds., chapter 12. Wiley, 2017, To appear.[3] Michael Brandstein and Darren Ward,

Microphone arrays: signal pro-cessing techniques and applications , Springer Science & Business Me-dia, 2013.[4] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,”

Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997.[5] A. Graves, A. r. Mohamed, and G. Hinton, “Speech recognition withdeep recurrent neural networks,” in , May 2013, pp. 6645–6649.[6] Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent,Jonathan Le Roux, John R Hershey, and Bj¨orn Schuller, “Speech en-hancement with lstm recurrent neural networks and its application tonoise-robust asr,” in

International Conference on Latent Variable Anal-ysis and Signal Separation . Springer, 2015, pp. 91–99.[7] Xiong Xiao, Shinji Watanabe, Hakan Erdogan, Liang Lu, John Her-shey, Michael L Seltzer, Guoguo Chen, Yu Zhang, Michael Mandel,and Dong Yu, “Deep beamforming networks for multi-channel speechrecognition,” in

Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing . mar 2016, pp. 5745–5749,IEEE.[8] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals,“Learning the speech front-end with raw waveform cldnns,” 2015.[9] Jean Carletta, “Unleashing the killer corpus: experiences in creatingthe multi-everything ami meeting corpus,”

Language Resources andEvaluation , vol. 41, no. 2, pp. 181–190, 2007.[10] S. Renals, T. Hain, and H. Bourlard, “Recognition and understandingof meetings the ami and amida projects,” in , Dec 2007, pp.238–247.[11] Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe,“The third chime speech separation and recognition challenge: Dataset,task and baselines,” in

Automatic Speech Recognition and Understand-ing (ASRU), 2015 IEEE Workshop on . IEEE, 2015, pp. 504–511.[12] M. I. Mandel, R. J. Weiss, and D. P. W. Ellis, “Model-basedexpectation-maximization source separation and localization,”

IEEETransactions on Audio, Speech, and Language Processing , vol. 18, no.2, pp. 382–394, Feb 2010.[13] Hiroshi Sawada, Shoko Araki, and Shoji Makino, “Underdeterminedconvolutive blind source separation via frequency bin-wise clusteringand permutation alignment,”

IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 19, no. 3, pp. 516–527, 2011.[14] Deblin Bagchi, Michael I Mandel, Zhongqiu Wang, Yanzhang He, An-drew Plummer, and Eric Fosler-Lussier, “Combining spectral featuremapping and multi-channel model-based source separation for noise-robust automatic speech recognition,” in

Proceedings of the IEEEWorkshop on Automatic Speech Recognition and Understanding , 2015.[15] Antony W Rix, John G Beerends, Michael P Hollier, and An-dries P Hekstra, “Perceptual evaluation of speech quality (pesq)-anew method for speech quality assessment of telephone networks andcodecs,” in

Acoustics, Speech, and Signal Processing, 2001. Pro-ceedings.(ICASSP’01). 2001 IEEE International Conference on . IEEE,2001, vol. 2, pp. 749–752.[16] Emmanuel Vincent, R´emi Gribonval, and C´edric F´evotte, “Perfor-mance measurement in blind audio source separation,”

IEEE trans-actions on audio, speech, and language processing , vol. 14, no. 4, pp.1462–1469, 2006. [17] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, OndrejGlembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, YanminQian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely,“The kaldi speech recognition toolkit,” in

IEEE 2011 Workshop onAutomatic Speech Recognition and Understanding . Dec. 2011, IEEESignal Processing Society, IEEE Catalog No.: CFP11SRW-USB.[18] Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent,“Multichannel audio source separation with deep neural networks,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 24, no. 9, pp. 1652–1664, 2016.[19] Lukas Pfeifenberger, Matthias Zohrer, and Franz Pernkopf, “Dnn-based speech mask estimation for eigenvector beamforming,” in

Acous-tics, Speech and Signal Processing (ICASSP), 2017 IEEE InternationalConference on . 2017, IEEE SigPort.[20] Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach, “Neuralnetwork based spectral mask estimation for acoustic beamforming,” in

Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE Inter-national Conference on . IEEE, 2016, pp. 196–200.[21] Jahn Heymann, Lukas Drude, Christoph Boeddeker, Patrick Hanebrink,and Reinhold Haeb-Umbach, “Beamnet: End-to-end training of abeamformer-supported multi-channel asr system,” in

Acoustics, Speechand Signal Processing (ICASSP), 2017 IEEE International Conferenceon , 2017.[22] Christoph Boeddeker, Patrick Hanebrink, Jahn Heymann, Drude Lukas,and Reinhold Haeb-Umbach, “Optimizing neural-network supportedacoustic beamforming by algorithmic differentiation,” in

Acoustics,Speech and Signal Processing (ICASSP), 2017 IEEE InternationalConference on , 2017.[23] Michael I Mandel and Jon P Barker, “Multichannel spatial clusteringfor robust far-ﬁeld automatic speech recognition in mismatched condi-tions,” in

Proceedings of Interspeech , 2016, pp. 1991–1995.[24] Felix Grezes, Ni Zhaoheng, Trinh Viet Anh, and Michael Mandel,“Combining spatial clustering with lstm speech models for multichan-nel speech enhancement,” in

Interspeech , 2017, Submitted.[25] Hakan Erdogan, John R Hershey, Shinji Watanabe, and JonathanLe Roux, “Phase-sensitive and recognition-boosted speech separationusing deep recurrent neural networks,” in

Acoustics, Speech and SignalProcessing (ICASSP), 2015 IEEE International Conference on . IEEE,2015, pp. 708–712.[26] Hakan Erdogan, John R Hershey, Shinji Watanabe, Michael Mandel,and Jonathan Le Roux, “Improved mvdr beamforming using single-channel mask prediction networks,” in

Proc. INTERSPEECH , 2016.[27] Franc¸ois Chollet, “Keras,” https://github.com/fchollet/keras , 2015.[28] Mart´ın Abadi and et al, “TensorFlow: Large-scale machine learning onheterogeneous systems,” 2015, Software available from tensorﬂow.org.[29] Philipos C Loizou,

Speech Enhancement: Theory and Practice (SignalProcessing and Communications) , CRC, 2007.[30] Michael I Mandel and Jon P Barker, “Multichannel spatial clusteringfor robust far-ﬁeld automatic speech recognition in mismatched condi-tions,”

Interspeech 2016 , pp. 1991–1995, 2016.[31] Mehrez Souden, Jingdong Chen, Jacob Benesty, and Soﬁene Affes, “Anintegrated solution for online multichannel noise tracking and reduc-tion,”

IEEE Transactions on Audio, Speech, and Language Processing ,vol. 19, no. 7, pp. 2159–2169, 2011.[32] Israel Cohen and Baruch Berdugo, “Noise estimation by minima con-trolled recursive averaging for robust speech enhancement,”