Improved Speaker-Dependent Separation for CHiME-5 Challenge
Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie, Dong Yu
IImproved Speaker-Dependent Separation for CHiME-5 Challenge
Jian Wu , ∗ , Yong Xu , Shi-Xiong Zhang , Lian-Wu Chen , Meng Yu , Lei Xie , Dong Yu School of Computer Science, Northwestern Polytechnical University, Xi’an, China Tencent AI Lab, Shenzhen, China Tencent AI Lab, Bellevue, USA { jianwu,lxie } @nwpu-aslp.org, { lucayongxu,auszhang,lianwuchen,raymondmyu,dyu } @tencent.com Abstract
This paper summarizes several follow-up contributions for im-proving our submitted NWPU speaker-dependent system forCHiME-5 challenge, which aims to solve the problem ofmulti-channel, highly-overlapped conversational speech recog-nition in a dinner party scenario with reverberations and non-stationary noises. We adopt a speaker-aware training method byusing i-vector as the target speaker information for multi-talkerspeech separation. With only one unified separation model forall speakers, we achieve a 10% absolute improvement in termsof word error rate (WER) over the previous baseline of 80.28%on the development set by leveraging our newly proposed dataprocessing techniques and beamforming approach. With ourimproved back-end acoustic model, we further reduce WER to60.15% which surpasses the result of our submitted CHiME-5challenge system without applying any fusion techniques.
Index Terms : CHiME-5 challenge, speaker-dependent speechseparation, robust speech recognition, speech enhancement,beamforming
1. Introduction
As the recent progress in front-end audio processing, acousticand language modeling, automatic speech recognition (ASR)techniques are widely deployed in our daily life. However,the performance of ASR will severely degrade in challengingacoustic environments (e.g., overlapping, noisy, reverberatedspeech), mainly due to the unseen complicated acoustic con-ditions in the training. Many previous work on acoustic robust-ness focused on one aspect, e.g., speech separation [1, 2, 3, 4],enhancement [5, 6, 7, 8, 9], dereverberation [10, 11, 12], andetc. Those experiments were conducted on simulated data,which is not realistic in real applications. Recently releasedCHiME-5 challenge [13] provided a large-scale multi-speakerconversational corpus recorded via Microsoft Kinect in realhome environments and targeted at the problem of distant multi-microphone conversational speech recognition. As the record-ings are extremely overlapped among multiple speakers andcorrupted by the reverberation and background noises, WERsreported on the dataset are fairly high. In this paper, wemake several efforts based on our previously submitted speaker-dependent system [14] which ranked 3rd under unconstrainedLM and 5th under constrained LM for the single device track,respectively.The difficulties of CHiME-5 are three-fold. First, the natu-ral conversation contains casual contents, sometimes occupiedby laugh and coughing. Speaker interference is common in con-versational speech as well, which causes degradation on speech ∗ This work was done when the first author was an intern in TencentAI lab. recognition. Second, hardware devices, far-field wave propa-gation and ambient noises cause audio clipping, signal attenu-ation and noise corruption, respectively. Furthermore, the lackof the clean speech for supervised training greatly limits the al-gorithm design and external datasets are not allowed accordingto the rule of CHiME-5. By considering these aspects, robustfront-end processing of target speaker enhancement is criticalfor improving the ASR performance.Recent studies have made great efforts in multi-channelspeech enhancement [7, 8, 9, 15] and most of them estimatedthe Time-Frequency (TF) masks that encode the speech or noisedominance in each TF unit. Deep learning based beamform-ing became the most popular approach since CHiME-3 andCHiME-4 challenge [16], depending on the accurate estimationof speech covariance matrices. However, in CHiME-5 chal-lenge, it’s difficult to train the speech enhancement mask es-timator and obtain accurate predictions due to the lack of theoracle clean data required by supervised training. On the otherhand, there are many limitations on performing recently pro-posed monaural blind speech separation methods, e.g., DPCL[1], uPIT [2], because it’s necessary to do speaker tracking dueto the permutation issue. The number of speakers is also aprerequisite for monaural speech separation approaches, whileit is infeasible in CHiME-5 challenge. However, consideringthat the target speaker ID is given in each utterance, we triedspeaker-dependent (SD) separation in [14] and Du et al. useda speaker dependent system along with a two-stage separationmethod in [17].In this paper, we focus on single-array track and achievesignificant improvement with the following contributions. First,we process data by making use of GWPE [18], CGMM [8, 19]and OMLSA [20] to further remove the interference in thenon-overlapped data segments, which are used as the trainingtarget in the SD models. In [14], suffering from low-qualitytraining targets, the system just achieved 2% absolute reduc-tion on WER. Second, inspired by [21, 22, 23], we incorpo-rate i-vectors as auxiliary features, which aims to extract thetarget speaker. With the speaker-aware training technique, weachieve much better results using only one mask estimationmodel. Third, we investigate the beamforming performance,and observe that with more accurate speaker masks, general-ized eigenvalue (GEV) [24] beamformer performs better thanminimum variance distortionless response (MVDR) [25] beam-former. Finally, we report 10% absolute WER reduction on thedevelopment set and 20% with our improved acoustic modelwhich is based on the factored form of time-delay neural net-work (TDNN-F) [26]. Compared with the single systems sub-mitted for CHiME-5, our proposed system outperform most ofthem. And compared to [17], where a set of separation modelswere trained and a two-stage separation is performed, our SDmethod has low computational complexity apparently. a r X i v : . [ ee ss . A S ] A p r WPE CGMM OMLSASimulatorVAD4 (cid:28772) (cid:28772) (cid:28772) test datatraining datasegment noise reference mixturemasks Figure 1:
Flow chart of data processing and simulation
2. Proposed System
In this section, we will discuss the data processing, speaker-aware training and beamforming used in our new system andindicate how we boost previously submitted speaker-dependentfront-end.
In order to simulate training data for speaker-dependent mod-els, we use non-overlapped utterances as reference, which canbe segmented according to the provided annotations. However,those segments are not guaranteed to be in high signal-to-noiseratio (SNR) and may contain strong background noise, espe-cially in the kitchen room. These issues can lead to inaccu-rate training targets (e.g., IRM), which may result in slow con-vergence and bad performance of the separation model. In or-der to further remove noise in those segments and improve thequality of training targets, we utilize complex Gaussian mixturemodel (CGMM) to estimate speech masks in a unsupervisedmanner and perform MVDR beamforming to suppress back-ground noise. Following the suggestions from [27], GWPE isapplied on multi-channel signals to reduce potential reverber-ations before beamforming, which are also proved to benefitASR performance in the following experiments.We use a two-component CGMM, i.e., speech and noise,and TF-masks are computed as the following posterior λ kt,f = p ( y t,f | Θ k ) (cid:80) c p ( y t,f | Θ k ) k ∈ { n, s } , (1)where p ( y t,f | Θ k ) = N ( y t,f | , φ kt,f R kf ) . Following [7, 19],speech and noise covariance matrices are estimated via Φ kf = 1 (cid:80) t λ kt,f (cid:88) t λ kt,f y t,y y Ht,y k ∈ { n, s } , (2)where ( · ) H means conjugate transpose. For MVDR beamform-ing, steer vector d f at each frequency is required, and the prin-cipal eigenvector of Φ sf is an ideal estimation based on the factthat covariance matrices of the directional target is close to arank-one matrix. With Φ nf , d f , weights of MVDR is computedas w MVDR f = ( Φ nf ) − d f d Hf ( Φ nf ) − d f . (3)Considering that the enhanced speech obtained by beam-forming always contains residual noise, we continue to per-form single-channel denoising. One typical statistical methodis OMLSA [20], which was proposed for single-channel robustspeech enhancement. Although it may introduce speech distor-tion, it reduces the background noise and keeps the TF regionsof speech with higher energy, further improving the accuracy oftarget mask computation, especially in noise dominant TF bins.As shown in Fig.1, with those processed non-overlappedsegments as reference (clean) data, we perform data simulation,mask computation, network training, etc, in the following steps. Some of recent blind speech separation methods need to knowthe number of speakers in the mixture and can not assign out-put to specific speaker properly. Here it’s not suitable to usethem in CHiME-5 challenge which requires to recognize thespeech of target speaker in the given utterances. Under suchcircumstances, there are two optional methods for the front-endseparation system. One is to make use of speaker informationand condition the speech separation, similar to [22]. Anotherone is to train a set of models for each known speaker, like theone we used in [14] and also in [17]. In fact, the first one ismore applicable to real scenarios because it can generalize tounseen speakers if model is well trained and it also can avoidthe permutation problem at the same time.Our motivation is to use i-vectors as speaker features to biasthe prediction of the target masks. We tried two typical TF-masks, i.e, IRM and PSM, which are defined as m IRM = | s t | / ( | s t | + | n | ) , m PSM = | s t | cos( ∠ y − ∠ s t ) / | y | , (4)where y , s t , n are short-time Fourier transform (STFT) of mix-ture, target speaker and noise component respectively, whichsatisfies the equation y = s t + n . When simulating the trainingdata, we mix target speaker with background noise as well asone or two interference speakers at various SNRs. Consideringthat PSM is unbounded and may be negative, we truncated itsvalue between 0 and 1. Neural networks are trained by mini-mizing the mean square error L MSE = (cid:107) ˆ m − m t (cid:107) . (5)In the training stage, for a given noisy utterance and a spe-cific speaker, i-vectors are computed on random selected seg-ment from target speaker’s non-overlapped set, similar to [22].And during testing, we use average results instead of randomone to get a robust and stable prediction. Beamformer is a linear spatial filter applied on microphone sig-nals, which suppresses energy on non-target directions and pro-duces an enhanced output. On frequency domain it could bedescribed as s t,f = w Hf y t,f , (6)where w f is a complex valued vector on the frequency f . InSection 2.1 we introduce MVDR beamforming, which is a spe-cial case of parameterized multi-channel Wiener filter (PMWF) w PMWF − βf = ( Φ nf ) − Φ sf β + tr [( Φ nf ) − Φ sf ] u r (7)with β = 0 . u r is a vector indicating reference microphone,which can be manually specified or chosen by the estimation ofthe posterior SNR [28]. When β = 1 , it equals to multi-channelWiener filter (MCWF), another widely used beamforming insignal processing.In [7], GEV beamformer, which is obtained by Max-SNRcriterion and avoids matrix inversion in the computation, pro-vides better results than MVDR. The beamforming filter is de-signed to maximize expected SNR at each frequency: w GEV f = arg max w w H Φ sf ww H Φ nf w , (8)able 1: The description of the training data
Data ID Description Duration ×
32 100k far-field (cleaned+sp) 39h ×
33 reverberate on 1 64h ×
34 100k far-field (cgmm+mvdr) 35h5 100k far-field (gwpe,ch1) 35hTable 2:
Performance of different acoustic models
Structure Data WER% baseline 9-TDNN 1+2 80.28%baseline 9-TDNN 1+2+3 79.13%9-TDNN+1BLSTM 1+2+3 77.15%12-TDNN-F 1+2+3 70.02%5CNN+9-TDNN-F 1+2+3 68.72%5CNN+9-TDNN-F 1+2+3+4 %Original submission [14] - 70.49%which can be solved by forming a generalized eigenvalue prob-lem with Φ sf and Φ nf . To produce a distortionless speech sig-nal at the beamformer output, [24] also provides several post-filtering algorithms to normalize GEV coefficients. In our ex-periments, we adopt Blind Analytical Normalization (BAN) bydefault.
3. Experiments
The performance of acoustic models we tuned on the develop-ment data is given in Table 2, with the description of the train-ing data listed in Table 1. All models are trained with lattice-free maximum mutual information (LF-MMI, [29]) criterionvia KALDI [30] toolkit. Mel-frequency cepstral coefficients(MFCCs) and online i-vectors are adopted as input features. Inaddition to the training data used in official baseline ( ),we include reverberated data ( ) and enhanced data ( ) pro-cessed by GWPE and CGMM-MVDR . To simulate the rever-berated audio samples in , we take the room impulse response(RIR) dataset released in [31] but only use the portion of smallroom because it has a similar room size as CHiME-5.Our best configuration follows the successful practise ofCNN-TDNN-F structure in our original system [14]. As can beseen in Table 2, our boosted version of TDNN-F acoustic modelbrings 12% absolute WER reduction compared to the officialTDNN, which also surpasses our previous submitted result. Inthe following sections, we will mainly focus on the performanceof the front-end and evaluate the results with our own acousticmodel (see Table 4). The non-overlapped segments of each speaker we used arelisted in Table 3 and short segments (less than 2s) are discarded.The noise files come from non-speech intervals and a energybased VAD is used to filter out possible silence segments. Basedon those processed segments and the background noise files, we https://github.com/funcwj/setk/blob/master/scripts/run gwpe.sh https://github.com/funcwj/setk/blob/master/scripts/run cgmm.sh Original CH1
50 100 150 200 2506420 F r e qu e n c y ( k H z ) Estimated Mask
50 100 150 200 2506420 F r e qu e n c y ( k H z ) Enhancement Result
50 100 150 200 2506420 F r e qu e n c y ( k H z ) Figure 2:
A beamforming example and predicted target speakermasks. Although the interference speaker occupies most of thetime in the utterance, the estimation of the target TF-mask is ac-curate. The last row plots the spectrogram of the enhancementoutput, where the interference speech is well suppressed.
Table 3:
The number of non-overlapped segments per speakerused in data simulation on the development set
P05 P06 P07 P08 P25 P26 P27 P28
161 251 132 108 121 78 92 169simulate the data for speaker-dependent model training as de-picted in Fig.1.To demonstrate the effectiveness of the data processing dis-cussed in Section 2.1, we first evaluate the ASR performanceof GWPE followed by CGMM-MVDR. As can be seen in Sys-5 in Table 4, compared to the original CH1 (Sys-1), the dataprocessing step brings 4% absolute WER reduction.To evaluate the necessary of conducting single-channel de-noising for each speaker, we simulate two sets of data (each with ∼ A (without) andSD B (with), respectively. For each set, we mix target speakerwith 1 or 2 interference speakers as well as background noiserandomly with SDR between 0 and 10dB and SNR between -5and 10dB. We adopt a 2 × TDNN-3 × BLSTM structure with asigmoid output layer to estimate speaker masks and use IRM astraining targets. 513-dimensional log power spectrogram fea-tures are extracted as input, with utterance level cmvn applied.In Table 4, we can see that with the processing step GWPEand CGMM-MVDR, SD A gives 4% absolute WER reductioncompared to official baseline (Sys-3) and including OMLSAas a further step yields better results, showing in Sys-8. Bothmodels surpass our previous results without the data processingsteps. To illustrate the necessity of speaker separation, we alsotrain a denoising network in Sys-6 for comparison, which onlypredicts masks of speech instead of target speaker. Without thetarget information in the estimated masks, DN only producesa similar result as CGMM, which inspires us to focus on sep-aration more than enhancement or denoising in CHiME5 chal-lenge.able 4: WER (%) of each speaker on the development set with CNN-TDNN-F acoustic model
Sys Input Mask × - × × - × × - WDS 72.13 66.09 69.20 78.10 67.65 79.93 68.02 50.78 68.724 CH1-4 CGMM - MVDR 70.74 61.08 67.36 78.76 66.03 79.88 67.50 49.88 66.915 GWPE CGMM - MVDR 70.59 61.15 67.08 78.25 63.94 78.85 66.34 49.69 66.366 GWPE DN 1 MVDR 70.74 63.15 68.12 78.63 63.89 79.34 64.63 49.06 66.817 GWPE SD A B B
12 GWPE SA++ 1 MVDR 66.07 58.89 63.60 69.69 60.20 74.79 63.45 46.18 62.4113 GWPE SA++ 1 PMWF-1 65.12 58.03 63.46 69.66 60.97 76.12 64.28 46.09 62.3114 GWPE SA++ 1 GEV 62.45 58.11 61.60 61.64 57.99 69.90 65.70 46.54
Our motivation is to train a speaker-independent target separa-tion networks, which includes target speaker’s embeddings asauxiliary input and outputs mask estimation of the speaker. Un-fortunately, the model trained on the training set can not ex-ceed the results mentioned above. Here we apply the idea ofspeaker aware training on our speaker-dependent models, as thediscussed in Section 2.2. In our experiments, we adopt the samenetwork structure with the SD models, but concatenate i-vectorsin the second layer of TDNN and the following BLSTM layerto bias the prediction of target masks. The i-vectors used hereare extracted from non-overlapped segments shown in Section3. During the test stage, we average the i-vectors on those ut-terances and get one fixed embedding for each speaker. FromTable 4, SA gives similar result as SD B with MVDR beam-forming, but yields a significant improvement on GEV beam-former, which brings a notable WER reduction on speaker P25 ∼ P28. Compared to MVDR, GEV beamformer is more sen-sitive to TF-masks and may distort target speech and degradeASR performance seriously if mask is estimated inaccurately.Based on SA, we utilize two strategies to further improvethe performance of the speaker-aware separation and denote itas SA++ in the table. The first is to initialize network with a pre-trained model on the training data, considering that the numberof speakers and non-overlapped segments on the developmentis quite limited. Another one is to replace IRM with truncatedPSM, which has been proved to be effective in monaural speechenhancement. We give an example in Fig.2. Row 2 shows theoutput masks of SA++ given the log power spectrogram of mix-ture in row 1, which masks out the interference speakers verywell . We also compare GEV with other forms of beamforming(e.g. MCWF, MVDR in [28]), but no better results are achieved.Table 5 compares the proposed system with the perfor-mance of other teams, under the circumstance that not usingsystem combination. We get a 20% absolute WER reduction intotal compared to official result and outperform most of otherteams. Although it’s inferior to the USTC-iFlytek’s, our sys-tem perform separation only once and has low computationalcomplexity and model size apparently. The details on each ses- More enhancement samples are available at https://funcwj.github.io/online-demo/page/chime5
Table 5:
Single system comparison with other teams
Team WER (%)USTC-iFlytek [17] 57.10Ours 60.16JHU [32] 62.09Toshiba [33] 63.30STC [23] 63.30RWTH-Paderborn [34] 68.40Official [13] 80.28Table 6:
WER (%) summary on the official & our AM
AM Sess Din Kit Liv Avg Total
Baseline S02 70.82 79.79 62.11 70.26 70.46S09 74.58 71.29 67.38 68.06Ours S02 61.58 70.70 52.58 60.61
S09 63.24 59.21 56.22 59.21sion and location over official AM and ours are given in Table6. Even based on the official backend, our SD separation front-end contributes a 10% WER reduction, which is a significantimprovement on this challenging task.
4. Conclusions
In this work, we continue to optimize the performance of ourspeaker-dependent separation system submitted to CHiME-5challenge. We utilize multi-channel dereverberation and en-hancement algorithm, followed by single-channel denoising, toimprove the quality of the training targets. To crack the datascarcity problem in CHiME-5, we apply the idea of speaker-aware training on our speaker-dependent models and reduce thenumber of the front-end models to one, while bringing signifi-cant ASR improvement. Experiments show that with well tunedbeamforming, our system improves the ASR performance from80.28% official baseline to 70.46% in terms of WER. And withour own acoustic backend, our system achieves 60.16% WERon the development set, without using any fusion techniques. . References [1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus-tering: Discriminative embeddings for segmentation and separa-tion,” in
ICASSP . IEEE, 2016, pp. 31–35.[2] M. Kolbæk, D. Yu, Z.-H. Tan, J. Jensen, M. Kolbaek, D. Yu, Z.-H.Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural net-works,”
IEEE/ACM Transactions on Audio, Speech and LanguageProcessing (TASLP) , vol. 25, no. 10, pp. 1901–1913, 2017.[3] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speechseparation with deep attractor network,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 26, no. 4, pp.787–796, 2018.[4] Z.-Q. Wang and D. Wang, “Combining spectral and spatialfeatures for deep learning based blind speaker separation,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 27, no. 2, pp. 457–468, 2019.[5] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression ap-proach to speech enhancement based on deep neural networks,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 23, no. 1, pp. 7–19, 2015.[6] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio mask-ing for monaural speech separation,”
IEEE/ACM Transactions onAudio, Speech and Language Processing (TASLP) , vol. 24, no. 3,pp. 483–492, 2016.[7] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural networkbased spectral mask estimation for acoustic beamforming,” in
ICASSP . IEEE, 2016, pp. 196–200.[8] T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, andT. Nakatani, “Online mvdr beamformer based on complex gaus-sian mixture model with spatial prior for noise robust asr,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 25, no. 4, pp. 780–793, 2017.[9] Z.-Q. Wang and D. Wang, “On spatial features for supervisedspeech separation and its application to beamforming and robustasr,” in
ICASSP . IEEE, 2018, pp. 5709–5713.[10] K. Kinoshita, M. Delcroix, S. Gannot, E. A. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani,B. Raj et al. , “A summary of the reverb challenge: state-of-the-art and remaining challenges in reverberant speech processing re-search,”
EURASIP Journal on Advances in Signal Processing , vol.2016, no. 1, p. 7, 2016.[11] B. Wu, K. Li, M. Yang, and C.-H. Lee, “A reverberation-time-aware approach to speech dereverberation based on deep neuralnetworks,”
IEEE/ACM transactions on audio, speech, and lan-guage processing , vol. 25, no. 1, pp. 102–111, 2017.[12] D. S. Williamson and D. Wang, “Time-frequency masking inthe complex domain for speech dereverberation and denoising,”
IEEE/ACM transactions on audio, speech, and language process-ing , vol. 25, no. 7, pp. 1492–1501, 2017.[13] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth’chime’ speech separation and recognition challenge: Dataset,task and baselines,” arXiv preprint arXiv:1803.10609 , 2018.[14] Z. Zhao, J. Wu, and L. Xie, “The nwpu system for chime-5 chal-lenge,” in
Proc. CHiME 2018 Workshop on Speech Processing inEveryday Environments , 2018, pp. 16–18.[15] S. Gannot, E. Vincent, S. Markovich-Golan, A. Ozerov, S. Gan-not, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consoli-dated perspective on multimicrophone speech enhancement andsource separation,”
IEEE/ACM Transactions on Audio, Speechand Language Processing (TASLP) , vol. 25, no. 4, pp. 692–730,2017.[16] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The thirdchime speech separation and recognition challenge: Analysis andoutcomes,”
Computer Speech & Language , vol. 46, pp. 605–626,2017. [17] J. Du, Y.-H. Tu, L. Sun, F. Ma, H.-K. Wang, J. Pan, C. Liu, J.-D.Chen, and C.-H. Lee, “The ustc-iflytek system for chime-4 chal-lenge,” pp. 36–38, 2016.[18] T. Yoshioka and T. Nakatani, “Generalization of multi-channellinear prediction methods for blind mimo impulse response short-ening,”
IEEE Transactions on Audio, Speech, and Language Pro-cessing , vol. 20, no. 10, pp. 2707–2720, 2012.[19] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust mvdrbeamforming using time-frequency masks for online/offline asrin noise,” in
ICASSP . IEEE, 2016, pp. 5210–5214.[20] I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise environments,”
Signal processing , vol. 81, no. 11,pp. 2403–2418, 2001.[21] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa,and T. Nakatani, “Speaker-aware neural network based beam-former for speaker extraction in speech mixtures.” in
Interspeech ,2017, pp. 2655–2659.[22] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Her-shey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voice-filter: Targeted voice separation by speaker-conditioned spectro-gram masking,” arXiv preprint arXiv:1810.04826 , 2018.[23] I. Medennikov, I. Sorokin, A. Romanenko, D. Popov,Y. Khokhlov, T. Prisyach, N. Malkovskii, V. Bataev, S. Astapov,M. Korenevsky et al. , “The stc system for the chime 2018 chal-lenge,” in
The 5th International Workshop on Speech Processingin Everyday Environments (CHiME 2018), Interspeech , 2018.[24] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamformingbased on generalized eigenvalue decomposition,”
IEEE Transac-tions on audio, speech, and language processing , vol. 15, no. 5,pp. 1529–1539, 2007.[25] J. Benesty, J. Chen, and Y. Huang,
Microphone array signal pro-cessing . Springer Science & Business Media, 2008, vol. 1.[26] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohamadi,and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza-tion for deep neural networks,” in
Interspeech , 2018.[27] L. Drude, C. Boeddeker, J. Heymann, R. Haeb-Umbach, K. Ki-noshita, M. Delcroix, and T. Nakatani, “Integrating neural net-work based beamforming and weighted prediction error derever-beration,” in
Interspeech , 2018.[28] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, andJ. Le Roux, “Improved mvdr beamforming using single-channelmask prediction networks.” in
Interspeech , 2016, pp. 1981–1985.[29] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar et al. , “Purely sequence-trained neural networks for asr based onlattice-free mmi.” in
Interspeech , 2016, pp. 2751–2755.[30] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” IEEE Signal ProcessingSociety, Tech. Rep., 2011.[31] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur,“A study on data augmentation of reverberant speech for robustspeech recognition,” in
ICASSP . IEEE, 2017, pp. 5220–5224.[32] N. Kanda, R. Ikeshita, S. Horiguchi, Y. Fujita, K. Nagamatsu et al. , “The hitachi/jhu chime-5 system: Advances in speechrecognition for everyday home environments using multiple mi-crophone arrays,” in
The 5th International Workshop on SpeechProcessing in Everyday Environments (CHiME 2018), Inter-speech , 2018.[33] R. Doddipatla, T. Kagoshima, C.-T. Do, P. Petkov, C. Zorila et al. ,“The toshiba entry to the chime 2018 challenge,” in
The 5th Inter-national Workshop on Speech Processing in Everyday Environ-ments (CHiME 2018), Interspeech , 2018.[34] M. Kitza, W. Michel, C. Boeddeker, J. Heitkaemper, T. Menne,R. Schl¨uter, H. Ney et al. , “The rwth/upb system combination forthe chime 2018 workshop,” in