[PDF] Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

Abstract

Speech enhancement techniques based on deep learning have brought significant improvement on speech quality and intelligibility. Nevertheless, a large gain in speech quality measured by objective metrics, such as perceptual evaluation of speech quality (PESQ), does not necessarily lead to improved speech recognition performance due to speech distortion in the enhancement stage. In this paper, a multi-channel dilated convolutional network based frequency domain modeling is presented to enhance target speaker in the far-field, noisy and multi-talker conditions. We study three approaches towards distortionless waveforms for overlapped speech recognition: estimating complex ideal ratio mask with an infinite range, incorporating the fbank loss in a multi-objective learning and finetuning the enhancement model by an acoustic model. Experimental results proved the effectiveness of all three approaches on reducing speech distortions and improving recognition accuracy. Particularly, the jointly tuned enhancement model works very well with other standalone acoustic model on real test data.

Full PDF

DDistortionless Multi-Channel Target Speech Enhancement for OverlappedSpeech Recognition

Bo Wu , Meng Yu , Lianwu Chen , Yong Xu , Chao Weng , Dan Su , and Dong Yu Tencent AI Lab, Shenzhen, China Tencent AI Lab, Bellevue, WA, USA { lambowu, raymondmyu, lianwuchen, lucayongxu, cweng, dansu, dyu } @tencent.com Abstract

Speech enhancement techniques based on deep learning havebrought signiﬁcant improvement on speech quality and intelli-gibility. Nevertheless, a large gain in speech quality measuredby objective metrics, such as perceptual evaluation of speechquality (PESQ), does not necessarily lead to improved speechrecognition performance due to speech distortion in the en-hancement stage. In this paper, a multi-channel dilated convo-lutional network based frequency domain modeling is presentedto enhance target speaker in the far-ﬁeld, noisy and multi-talkerconditions. We study three approaches towards distortionlesswaveforms for overlapped speech recognition: estimating com-plex ideal ratio mask with an inﬁnite range, incorporating thefbank loss in a multi-objective learning and ﬁnetuning the en-hancement model by an acoustic model. Experimental resultsproved the effectiveness of all three approaches on reducingspeech distortions and improving recognition accuracy. Partic-ularly, the jointly tuned enhancement model works very wellwith other standalone acoustic model on real test data.

Index Terms : multi-channel enhancement, overlapped speechrecognition, complex mask, multi-objective, joint training

1. Introduction

In the presence of interfering speakers, the target speech intelli-gibility is usually degraded in the mixed signal. Such deteriora-tion can severely affect automatic speech recognition (ASR).Although many techniques have been developed for speechenhancement/separation [1–9] and recognition [10–12] underthese circumstances, it still remains one of the most challengingproblems in ASR. A large gain in speech quality can translateto a negligible improvement in recognition accuracy [13, 14].The discrepancy in objectives between speech enhancement andrecognition results in the performance inconsistency. Com-pared to the aggressive noise reduction, the subtle speech distor-tion introduced in the enhancement stage does not affect muchon the enhancement loss or evaluation metrics, such as PESQand signal-to-distortion ratio (SDR). Determined by the trainingloss and non-linear activations in the neural network, the situ-ation of speech distortion is even worse in low signal-to-noiseratio (SNR) conditions. Such distortion is harmful to ASR.To reduce speech distortion in the front-end processing, thestudy in [15] proposes a progressive learning framework byguiding each hidden layer of the deep neural network to learnan intermediate target with gradual signal-to-noise ratio gainsexplicitly. The work presented in [16] imposes additional conti-nuity constraints to alleviate the over-estimate or under-estimateproblems in the reconstructed signal. Wang et al. overcomesthe distortion problem by performing a distortion independentback-end acoustic model [17]. Moreover, jointly modeling the front-end enhancement and back-end acoustic model is anotherdesirable solution for improving recognition accuracy in noisyand multi-talker environments. For example, Chang et al. de-signs a neural sequence-to-sequence architecture for end-to-endmulti-channel multi-speaker speech recognition in [18]. Otherresearchers in [19–21] propose to jointly train a neural beam-former and acoustic model for noise robust ASR. Nevertheless,most of them fail to answer what happens to the enhanced wave-forms through the joint training [12,22] or whether the enhance-ment model ﬁne-tuned by the jointly trained acoustic model stillhelps recognition on a standalone ASR system [18, 23].In this paper, based on our previous work on end-to-endmulti-channel convolutional TasNet with short-time Fouriertransform (STFT) kernel for target speech enhancement [13],we study and compare three main approaches towards distor-tionless waveforms for recognition, regarding mask types, tar-get domains and loss functions, respectively. The contributionof this paper is three-fold. First, we compare various distor-tionless approaches in the same setup, which is often separatelydiscussed in different studies. Second, with the joint optimiza-tion on magnitude and phase, we ﬁnd that uncompressed com-plex ideal ratio mask (cIRM) leads to signiﬁcant protection ontarget speech signal [24, 25]. Third, thanks to the error back-propagation from the loss of acoustic model [12,26], the speechenhancement model produces distortionless signals, resulting ineffective ASR performance in this particular end-to-end jointtraining setup. Furthermore, we show that such enhanced sig-nals work well with standalone ASR system on large recordedtest sets as well.The rest of the paper is organized as follows. In Section 2,we recap our direction-aware multi-channel enhancement net-work. In Section 3, we present three distortionless methods.We describe our experimental setups and evaluate the effective-ness of the presented approaches in Section 4. We conclude thiswork in Section 5.

2. Multi-Channel Enhancement

Figure 1 shows our previous work on direction-aware multi-channel target speech enhancement framework [13], which re-covers target speaker’s voice from the reverberant, noisy andmulti-talker mixed signal. It consists of three major parts:(1) An encoder (a ﬁxed STFT convolution 1-D layer) transformsthe input waveform to STFT domain. A reference channel, usu-ally 1st channel waveform y without the loss of generality istransformed to spectral magnitude Y which is used to computelog-power spectral (LPS) by log( Y ) . The LPS feature vector isthen concatenated with inter-channel phase differences (IPDs)and target speaker-dependent angle feature (AF) [27]. IPD fea-ture represents spatial location information [28] and is calcu-lated by the phase difference between two channels of complex a r X i v : . [ ee ss . A S ] J u l igure 1: Block diagram of the multi-channel target speech enhancement network. spectrogram as: IPD k ( t, f ) = ∠ Y k ( t, f ) Y k ( t, f ) (1)where k and k are two microphones of the k -th microphonepair and K is the total number of selected microphone pairs. Anangle feature is incorporated as a target speaker bias. This fea-ture was originally introduced in [28], which computes the aver-aged cosine distance between the target speaker steering vectorand IPD on all selected microphone pairs asAF ( t, f ) = K (cid:88) k =1 e k ( f ) Y k ( t,f ) Y k ( t,f ) (cid:12)(cid:12)(cid:12) e k ( f ) Y k ( t,f ) Y k ( t,f ) (cid:12)(cid:12)(cid:12) (2)where e k ( f ) is the steering vector coefﬁcient target speaker atfrequency f with respect to k -th microphone pair. As a result,AF indicates if a speaker from a desired direction dominates ineach time-frequency bin, which drives the network to extractthe target speaker from the mixture.(2) An enhancement block estimates the target speaker’s idealratio mask. A temporal fully-convolutional network (TCN) [9]is adopted in the enhancement network which infers the targetspeaker’s ideal ratio mask ˆ M IRM activated by ReLu function and θ is the model parameter: g ( LPS, IPDs, AF ; θ ) = ˆ M IRM (3)(3) A decoder (a ﬁxed iSTFT convolution 1-D layer) recon-structs waveform. A single-channel enhanced waveform ˆ s isreconstructed from the multiplication between mixture magni-tude Y and target speaker mask ˆ M IRM as: ˆ s = iSTFT ( ˆ M IRM ⊗ Y , ϕ ) (4)where ⊗ is the element-wise product of two operands and ϕ represents the ﬁrst-channel mixture speech phase.The scale-invariant signal-to-distortion (SI-SNR) is used asthe objective function to optimize the enhancement networkwhich is deﬁned as:SI-SNR := 10 log (cid:107) s target (cid:107) (cid:107) e noise (cid:107) (5)where s target = ( (cid:104) ˆ s , s (cid:105) s ) / (cid:107) s (cid:107) , e noise = ˆ s − s target , and ˆ s and s are the estimated and reverberant target speech waveforms,respectively. The zero-mean normalization is applied to ˆ s and s for scale invariance. We refer the readers to [13] for more detailsabout the implementation of the multi-channel target speech en-hancement model.

3. Distortionless Methods

Although multi-channel target speech enhancement has beenproved effective in terms of PESQ and SDR [13, 14], directlypassing the enhanced signal to ASR systems does not achieveexpected improvements in recognition accuracy. Figure 2 (a)and (b) display spectrograms of a reverberant target speech and the overlapped speech, respectively. For overlapped speech sep-aration, the speech distortion problem is mainly caused by anenhancement algorithm which performs too aggressively, espe-cially when the interfering speakers are stronger than the targetspeaker. Figure 2 (c) presents the output spectrogram estimatedby the IRM-based multi-channel model described in Section 2using SI-SNR loss. Due to its destructive interference suppres-sion, lots of holes appear in the enhanced spectrogram whencompared with reverberant target speech in (a). The situation isparticularly worse in the blue box where an interfering speakerdominates in time-frequency bins and most of the spectrogramcontents are removed in the output. The harm of processingartifacts introduced during target speech enhancement may out-weigh the beneﬁt brought by interference suppression. We nextinvestigate three types of methods to reduce speech distortions.Figure 2:

Reverberant, mixture and enhanced spectrograms.

The range of ˆ M IRM in Section 2 is [0 , + ∞ ) with ReLu activa-tion, causing the model outputs easily trapped around 0 whenthe energy of the target speech is lower than the interferingspeech, and thus introducing holes in the enhanced spectrogram ˆ M IRM ⊗ Y . The problem becomes more severe in an sigmoidmask. One way to deal with this drawback is to predict linearcIRM of unbounded range instead: ˆ M cIRM = ˆ M r + i ˆ M i (6)where ˆ M r , ˆ M i ∈ ( −∞ , + ∞ ) . Figure 2 (d) illustrates theenhancement result estimated by a cIRM-based multi-channelmodel subject to a SISNR loss. Those spectrogram holes appar-ently disappear when compared with the IRM-based enhancedpectrogram. And most of the enhanced spectrogram at low andintermediate frequencies are restored in the blue box. Studiesin [24, 25] conclude that the gain of using cIRM is that both themagnitude and phase spectra are jointly estimated in the com-plex domain. And we observe that the inﬁnite range of cIRMleads to eliminate speech distortions in enhancement. Another way to obtain a distortionless spectrogram is to trainthe enhancement model in a multitask manner with an extratraining loss in fbank domain as: L = L SISNR + α · L MSE { LFB (ˆ s ) − LFB ( s ) } (7)where LFB( · ) operation extracts log ﬁlterbank (LFB) features ofestimated waveform ˆ s and reverberant target speech s , respec-tively. α is the weight applied to the loss on LFB domain. Byincorporating the loss on LFB domain, the enhancement modeltends to predict the target speech whose LFB feature better ﬁtsthe acoustic model. The blue box in Figure 2 (e) shows thathigh-frequency contents of the enhanced spectrogram are re-stored. Due to LFB’s low resolution in frequency, the overallenhancement is smooth and the harmonics in the blue box turnto be blurred. An integrated end-to-end paradigm by jointly modeling thefront-end enhancement and back-end acoustic model [12, 26]is a desirable solution for eliminating the impact of distortionon recognition, since the target speech enhancement front-endis directly optimized towards improved the speech recognitionaccuracy. A hybrid deep learning framework is adopted toperform the joint training for multi-channel overlapped speechrecognition. We directly stack the LFB extraction layer ofa convolutional, long short-term memory and fully connecteddeep neural network (CLDNN) acoustic model on top of the en-hancement network’s decoder layer. The connectionist tempo-ral classiﬁcation (CTC) object function used to train the acous-tic model is utilized to ﬁne-tune the weights of enhancementand recognition models. The blue box of Figure 2 (f) highlightsthat acoustic model ﬁne-tuned enhanced speech retains spectraldetails and observable harmonics, which is noted to be a closermatch to the original target speech spectrogram. The gain ofjoint model is simply summarized from a joint optimization ofthe speech enhancement and recognition networks in [12, 26].Our detailed observation shows that the acoustic model drivenenhancement creates less distortion in the output spectrogram,beneﬁcial for recognition. The effectiveness of the above dis-tortionless methods on recognition is demonstrated in Section4.

4. Experiments

We simulated a multi-channel reverberant version of two-speaker mixture data set by AISHELL-1 corpus, which is a pub-lic data set for Mandarin speech recognition [29]. A 6-elementuniform circular array is used as the signal receiver, the radiusof which is 0.035 m. The target speaker is mixed with an in-terfering speaker randomly at signal to interference ratio (SIR)-6, 0 or 6 dB. The classic image method [30] is used to addmulti-channel room impulse response (RIR) to each source inthe mixture and reverberation time (RT60) ranges from 0.05to 0.5 s. The room conﬁguration (length-width-height) is ran-domly sampled from 3-3-2.5 m to 8-10-6 m. The microphone array and speakers are at least 0.3 m away from the wall. Thedistance between microphone array and speakers ranges from1 m to 5 m. The speaker’s direction-of-arrival ranges from 0 to °, so that our data set contains samples with the angle dif-ference of two speakers ranging from 0 to °. Moreover, thetrain, validation and test sets consist of 340, 40 and 20 speak-ers, respectively. The speakers in the three sets are not over-lapped, which means approaches are evaluated under speaker-independent scenario. All data is sampled at 16 kHz.

For the encoder and decoder settings, the kernel size and strideare 512 and 256 samples, respectively. The kernel weights areset according to STFT/iSTFT operation. 257-dimensional LPSfeature is extracted based on the output of STFT kernel from theﬁrst channel mixture. 6 IPDs are extracted between microphonepairs (1, 4), (2, 5), (3, 6), (1, 2), (3, 4) and (5, 6). Note that,to eliminate the impact of direction of arrival estimation erroron our ﬁndings, the target speaker’s direction is assumed to beknown for computing AF.

A linear connection layer with 257-dimensional input and 40-dimensional output is used to extract LFB feature from single-channel waveforms with 25-ms window length and 10-ms hopsize. The CLDNN model starts with two convolutional lay-ers and then four LSTM layers, each with 512 hidden units,and then two full-connection linear layers plus a softmax layer.We use context-independent phonemes as the modeling units,which form 218 classes in our Chinese ASR system. A tri-gramlanguage model (LM) estimated on AISHELL-1 text is used.

First, to obtain the best recognition result on test data en-hanced by the IRM-based network, we train the acoustic mod-els using different training data and evaluate them on 4 testsets in Table 1. “cln.”, “rev.”, “mix.” denote dry clean sig-nal of target speaker, reverberant signal of target speaker andﬁrst channel of input signal, respectively. The IRM-based en-hancement network using SISNR loss, denoted as “base”, infersa single-channel output signal “base-enh” based on 6-channeloverlapped noisy speech. A competitive character error rate(CER) of 11.97% is attained on clean set “cln.” by acousticmodel A1 trained on clean set only. Initialized with A1, severalmulti-condition acoustic models are investigated. Based on allmulti-condition training sets, acoustic model A4 achieves thebest performance with a CER of 29.15%.Table 1:

CER of clean and multi-condition acoustic models

AM initialized training data test data cln. rev. mix. base-enh cln. rev. mix. base-enhA1 × (cid:88) × × × (cid:88) (cid:88) (cid:88) × × × × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) We next provide the results of a cIRM-based multi-channel tar-get speech enhancement network subject to a SISNR constraintlabeled as “sept-1” in Table 2. Same as the optimal trainingstrategy in Section 4.3.1, the acoustic model is trained on “cln.”,able 2:

CER and PESQ of distortionless methods on different SIRs and angle differences system mask loss function SIR angle difference Avg. PESQ -6 dB 0 dB 6 dB 0 ◦ -15 ◦ ◦ -45 ◦ ◦ -90 ◦ ◦ -180 ◦ base IRM SISNR 36.92 27.88 22.74 38.15 28.81 27.29 26.19 29.15 2.72sept-1 cIRM SISNR 33.92 25.55 20.46 35.93 25.91 24.92 23.70 26.62 2.86sept-2 cIRM SISNR+MSE(LFB) 28.60 22.20 18.58 31.88 22.46 21.20 20.61 23.31 3.25joint cIRM CTC 27.49 21.03 17.25 31.52 21.17 19.84 19.17 21.90 2.88 Table 3:

CER of distortionless methods on real recorded RIRs using a mismatched DFSMN-based ASR system system mask loss function SIR angle difference Avg. -6 dB 0 dB 6 dB 90 ◦ ◦ cln. NA NA 1.13 1.07 1.10 1.04 1.19 1.10rev. NA NA 1.43 1.35 1.41 1.26 1.60 1.40mix. NA NA 96.76 75.37 42.81 69.28 75.16 71.64base IRM SISNR 21.25 9.14 5.32 11.37 12.57 11.87sept-1 cIRM SISNR 19.11 8.12 4.80 10.35 11.04 10.65sept-2 cIRM SISNR+MSE(LFB) 17.31 7.59 4.23 9.28 10.20 9.68joint cIRM CTC 14.08 6.20 3.63 7.86 8.04 7.95 “rev.”, “mix.” and enhanced data by “sept-1”. We attain a lowerCER of 26.62% on test data enhanced by “sept-1”. Moreover,“sept-1” consistently outperforms “base” in all tested SIRs andangle differences, illustrating the effectiveness of using distor-tionless cIRM-based reconstructed waveforms for recognition.Besides, cIRM-based enhancement also achieves better speechquality with a PESQ value of 2.86, compared to 2.72 in theIRM-based method. “sept-2” is a multi-channel target speech enhancement modelestimating cIRM under SISNR and mean squared error of LFBconstraints. α = 1 achieves the best CER score in our ex-periments. On test data enhanced by “sept-2”, with addingrecognition feature constraint, the acoustic model trained on allmulti-condition data including “cln.”, “rev.”, “mix.” and “sept-2” enhanced data further boosts CER to 23.31% from 26.62% in“sept-1”. Moreover, comparing to “sept-1”, “sept-2” achievesbetter PESQ value of 3.25. Finally, we compare the joint model with the above separatelytrained systems. “joint” is initialized with the well-trainedfront-end enhancement model “sept-1” and back-end acous-tic model trained on all multi-condition data including “cln.”,“rev.”, “mix.” and “sept-1” enhanced speech. A lower CERof 21.90% is achieved. If we force back-end acoustic modelin “joint” frozen while front-end enhancement model learn-able during joint training, a worse CER of 22.27% is obtained,demonstrating the superiority of an end-to-end joint modelwith trainable enhancement and acoustic models. As shownin Table 2, “joint” illustrates stable performances and consis-tently outperforms all separately trained systems in all SIRsand angle difference categories. Speciﬁcally, a signiﬁcant CERdecrement is achieved from 29.15% in “base” to 21.90% us-ing “joint”, showing a relative improvement of about 25%. Itshould be noted that pretraining “joint” with a better enhance-ment model “sept-2” or training “joint” in a multitask man-ner as SISNR+CTC can further boost the recognition perfor-mance in our supplementary experiments. With matched acous-tic modeling, we can see that acoustic model ﬁne-tuned speechachieves the highest recognition accuracy when compared withother two kinds of distortionless waveforms. Although “joint”achieves a much lower PESQ of 2.88 relative to 3.25 in “sept-2”, it boosts CER to 21.90% from 23.31% in “sept-2”. This isconsistent with our analysis that a large gain in speech quality measured by objective metrics, does not necessarily lead to im-proved speech recognition performance due to speech distortionin the enhancement stage.

It is important to evaluate the distortionless methods in real-world conditions. Considering there is no public overlappedreal data in a circular uniform array and collecting mixed speechis very time-consuming, we choose to record real RIRs in 6 re-alistic rooms. The 6-channel RIRs were measured with variousloudspeaker and microphone distances of 0.5, 1, 2, 3 and 5 me-ters and azimuth angles of 0 ◦ , 90 ◦ , 180 ◦ and 270 ◦ , so thatour real test set contains samples with angle differences of 90 ◦ and 180 ◦ . Note that angle difference equal to 0 ◦ can not behandled by direction-aware enhancement algorithms, since thetarget and interfering speakers are in the same direction. Themixed test data is generated in the same manner as in Sec-tion 4.1. More importantly, if we conclude that the gain ofusing cIRM, recognition feature constraint and acoustic modelﬁne-tuned enhancement comes from distortionless waveforms,it is necessary to prove the enhanced distortionless speech con-sistently helps recognition on a mismatched ASR system thatunseens the datasets used to train enhancement and acous-tic models. We therefore directly evaluate the real recordedspeech, enhanced by “base”, “sept-1”, “sept-2” and acousticmodel ﬁne-tuned enhancement systems without retraining, ona well-trained deep feed-forward sequential memory network(DFSMN)-based ASR system [31] in Table 3. The standaloneASR system is trained on a 10K-hour mixed Mandarin datasetin application domains and we refer the readers to [31] for moredetails. Experimental results show that CERs of 1.10% and1.40% are obtained on clean and reverberant speech, respec-tively, demonstrating the excellent performance of the DFSMN-based ASR system. Relative to “base”, all separately and jointlytrained systems perform better and the acoustic model ﬁne-tuned enhancement still attains the best CER in all tested SIRsand angle differences.

5. Conclusions

We assess three main approaches towards distortionless wave-forms for overlapped speech recognition in this paper. Weshow that all the methods are effective to reduce speech dis-tortions and can improve recognition. And acoustic model ﬁne-tuned enhancement outperforms all separately trained systemsfor simulated test data with joint-trained acoustic model, or realtest data with well-trained standalone acoustic model. . References [1] DeLiang Wang and Jitong Chen, “Supervised speech separationbased on deep learning: an overview,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 26, no. 10, pp.1702–1726, 2018.[2] John R Hershey, Zhuo Chen, Jonathan Le Roux, and ShinjiWatanabe, “Deep clustering: Discriminative embeddings for seg-mentation and separation,” in

Proc. ICASSP , 2016, pp. 31–35.[3] Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, andJohn R Hershey, “Single-channel multi-speaker separation usingdeep clustering,” arXiv:1607.02173 , 2016.[4] Yi Luo, Zhuo Chen, and Nima Mesgarani, “Speaker-independentspeech separation with deep attractor network,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol.26, no. 4, pp. 787–796, 2018.[5] Zhuo Chen, Yi Luo, and Nima Mesgarani, “Deep attractornetwork for single-microphone speaker separation,” in

Proc.ICASSP , 2017, pp. 246–250.[6] Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen,“Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in

Proc. ICASSP ,2017, pp. 241–245.[7] Morten Kolbæk, Dong Yu, Zheng-Hua Tan, and Jesper Jensen,“Multitalker speech separation with utterance-level permutationinvariant training of deep recurrent neural networks,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol.25, no. 10, pp. 1901–1913, 2017.[8] Yi Luo and Nima Mesgarani, “TasNet: time-domain audio sep-aration network for real-time, single-channel speech separation,”in

Proc. ICASSP , 2018, pp. 696–700.[9] Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassingideal time–frequency magnitude masking for speech separation,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 27, no. 8, pp. 1256–1266, 2019.[10] Chao Weng, Dong Yu, Michael L Seltzer, and Jasha Droppo,“Deep neural networks for single-channel multi-talker speechrecognition,”

IEEE/ACM Transactions on Audio, Speech and Lan-guage Processing , vol. 23, no. 10, pp. 1670–1679, 2015.[11] Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watan-abe, “The third CHiME speech separation and recognition chal-lenge: dataset, task and baselines,” in

IEEE Workshop on Auto-matic Speech Recognition and Understanding (ASRU) , 2015, pp.504–511.[12] Bo Wu, Kehuang Li, Fengpei Ge, Zhen Huang, Minglei Yang,Sabato Marco Siniscalchi, and Chin-Hui Lee, “An end-to-enddeep learning approach to simultaneous speech dereverberationand acoustic modeling for robust speech recognition,”

IEEE Jour-nal of Selected Topics in Signal Processing , vol. 11, no. 8, pp.1289–1300, 2017.[13] Fahimeh Bahmaninezhad, Jian Wu, Rongzhi Gu, Shi-XiongZhang, Yong Xu, Meng Yu, and Dong Yu, “A comprehensivestudy of speech separation: spectrogram vs waveform separation,” arXiv:1905.07497 , 2019.[14] Z. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deepclustering: Discriminative spectral and spatial embeddings forspeaker-independent speech separation,” in

Proc. ICASSP , 2018,pp. 1–5.[15] Tian Gao, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “SNR-basedprogressive learning of deep neural network for speech enhance-ment.,” in

Proc. INTERSPEECH , 2016, pp. 3713–3717.[16] Yong Xu, Jun Du, Zhen Huang, Li-Rong Dai, and Chin-Hui Lee, “Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement,” arXiv:1703.07172 , 2017.[17] Peidong Wang, Ke Tan, et al., “Bridging the gap betweenmonaural speech enhancement and recognition with distortion-independent acoustic modeling,”

IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , vol. 28, pp. 39–48, 2019. [18] Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan LeRoux, and Shinji Watanabe, “MIMO-speech: End-to-end multi-channel multi-speaker speech recognition,” arXiv:1910.06522 ,2019.[19] Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R Hershey,and Xiong Xiao, “Uniﬁed architecture for multichannel end-to-end speech recognition with neural beamforming,”

IEEE Journalof Selected Topics in Signal Processing , vol. 11, no. 8, pp. 1274–1288, 2017.[20] Yong Xu, Chao Weng, Like Hui, Jianming Liu, Meng Yu, DanSu, and Dong Yu, “Joint training of complex ratio mask basedbeamformer and acoustic model for noise robust ASR,” in

Proc.ICASSP , 2019, pp. 6745–6749.[21] Jahn Heymann, Lukas Drude, Christoph Boeddecker, PatrickHanebrink, and Reinhold Haeb-Umbach, “BEAMNET: End-to-end training of a beamformer-supported multi-channel ASR sys-tem,” in

Proc. ICASSP , 2017.[22] Z. Wang and D. Wang, “A joint training framework for robust au-tomatic speech recognition,”

IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 24, no. 4, pp. 796–806,2016.[23] Arun Narayanan and DeLiang Wang, “Improving robustnessof deep neural network acoustic models via speech separationand joint adaptive training,”

IEEE/ACM Transactions on Audio,Speech and Language Processing , vol. 23, no. 1, pp. 92–101,2014.[24] Donald S Williamson, Yuxuan Wang, and DeLiang Wang, “Com-plex ratio masking for joint enhancement of magnitude andphase,” in

Proc. ICASSP , 2016.[25] Donald S Williamson, Yuxuan Wang, and DeLiang Wang, “Com-plex ratio masking for monaural speech separation,”

IEEE/ACMTransactions on Audio, Speech and Language Processing , vol. 24,no. 3, pp. 483–492, 2015.[26] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, PhilemonBrakel, and Yoshua Bengio, “End-to-end attention-based largevocabulary speech recognition,” in

Proc. ICASSP , 2016, pp.4945–4949.[27] Rongzhi Gu, Lianwu Chen, Shi-Xiong Zhang, Jimeng Zheng,Yong Xu, Meng Yu, Dan Su, Yuexian Zou, and Dong Yu, “Neuralspatial ﬁlter: target speaker speech separation assisted with direc-tional information,” in

Proc. INTERSPEECH , 2019, pp. 4290–4294.[28] Zhuo Chen, Xiong Xiao, Takuya Yoshioka, Hakan Erdogan, JinyuLi, and Yifan Gong, “Multi-channel overlapped speech recogni-tion with location guided speech extraction network,” in

IEEESpoken Language Technology Workshop (SLT) , 2018, pp. 558–565.[29] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng,“AISHELL-1: An open-source mandarin speech corpus and aspeech recognition baseline,” in

Proc. O-COCOSDA , 2017, pp.1–5.[30] E.A. Lehmann and A.M. Johansson, “Prediction of energy de-cay in room impulse responses simulated with an image-sourcemodel,”

The Journal of the Acoustical Society of America , vol.124, no. 1, pp. 269–277, 2008.[31] Zhao You, Dan Su, Jie Chen, Chao Weng, and Dong Yu,“DFSMN-SAN with persistent memory model for automaticspeech recognition,” arXiv:1910.13282arXiv:1910.13282