ADL-MVDR: All deep learning MVDR beamformer for target speech separation
Zhuohuang Zhang, Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Dong Yu
AADL-MVDR: ALL DEEP LEARNING MVDR BEAMFORMER FOR TARGET SPEECHSEPARATION
Zhuohuang Zhang , (cid:63) , Yong Xu , Meng Yu , Shi-Xiong Zhang , Lianwu Chen , Dong Yu Indiana University, Bloomington, USA Tencent AI Lab
ABSTRACT
Speech separation algorithms are often used to separate thetarget speech from other interfering sources. However, purelyneural network based speech separation systems often causenonlinear distortion that is harmful for ASR systems. Theconventional mask-based minimum variance distortionlessresponse (MVDR) beamformer can be used to minimize thedistortion, but comes with high level of residual noise. Fur-thermore, the matrix inversion and eigenvalue decompositionprocesses involved in the conventional MVDR solution arenot stable when jointly trained with neural networks. In thispaper, we propose a novel all deep learning MVDR frame-work, where the matrix inversion and eigenvalue decomposi-tion are replaced by two recurrent neural networks (RNNs),to resolve both issues at the same time. The proposed methodcan greatly reduce the residual noise while keeping the tar-get speech undistorted by leveraging on the RNN-predictedframe-wise beamforming weights. The system is evaluated ona Mandarin audio-visual corpus and compared against severalstate-of-the-art (SOTA) speech separation systems. Experi-mental results demonstrate the superiority of the proposedmethod across several objective metrics and ASR accuracy.
Index Terms — Speech separation, speech enhancement,MVDR, ADL-MVDR, deep learning
1. INTRODUCTION
Environmental noises and adverse room acoustics can greatlyaffect the quality of the speech signal and therefore degradethe effectiveness of many speech communication systems(e.g., digital hearing-aid devices [1], and automatic speechrecognition (ASR) systems [2, 3, 4]). Speech enhancementand speech separation algorithms are thus proposed to alle-viate this problem. With the renaissance of neural networks,better objective performance can be achieved using deeplearning methods [5, 6, 7]. However, it often results in greateramount of nonlinear distortion on the separated target speech[8, 9, 10], which harms the performance of ASR systems.The minimum variance distortionless response (MVDR)filters [11] aim to reduce the noise while keeping the tar-
This work was done while Z. Zhang was a research intern at Tencent AILab, Bellevue, USA. (cid:63) [email protected] get speech undistorted. More recently, MVDR systems withneural network (NN) based time-frequency (T-F) mask esti-mator can help greatly reduce the word error rate (WER) ofASR systems with less amount of distortion [12, 13, 14], yetthey still suffer from residual noise problems since chunk-or utterance-level beamforming weights [15, 16, 14, 8] arenot optimal for noise reduction. Some frame-level MVDRweights estimation methods have been proposed, in [17], theauthors estimate the covariance matrix in a recursive way.Nevertheless, the calculated frame-wise weights are not sta-ble when jointly trained with NNs. Previous studies haveindicated that it is feasible for a recurrent neural network(RNN) to learn the matrix inversion efficiently [18, 19] andthat RNNs can better stabilize the process of matrix inversionand principal component analysis (PCA) when jointly trainedwith NNs.There are three main contributions in this work, firstly,we propose a novel all deep learning MVDR framework (de-noted as ADL-MVDR) where the ADL-MVDR can be jointlytrained stably with the front-end filter estimator for frame-level beamforming weights estimation. Secondly, we proposeto use RNNs to learn the matrix inversion and PCA from thenoise and target speech covariance matrices, instead of uti-lizing the traditional mathematical approach. Thirdly, insteadof using the classical per T-F bin mask, we adopt a complexratio filtering method [20] (denoted as cRF) to further stabi-lize joint training process and estimate the covariance matri-ces of target speech and noise more accurately. The RNNcomponents of ADL-MVDR system help to recursively esti-mate the statistical variables (i.e., inverse of the noise covari-ance matrix and PCA of the steering vector) in an adaptiveway. Meanwhile, a Conv-TasNet variant [9, 10] is adoptedas the front-end filter estimator to calculate the frame-levelcovariance matrices.The proposed cRF based ADL-MVDR system achievesthe best performance in many objective metrics as well as theASR accuracy. To the best of our knowledge, this is the firstpioneering study that applies RNNs to derive the MVDR so-lution by replacing the matrix inversion and PCA. Note thatXiao et al. [21] once proposed a directly NN-learned beam-forming weights method which was not successful due to lackof using noise information, whereas our approach still followsthe mask-based MVDR framework and explicitly utilizes the a r X i v : . [ ee ss . A S ] O c t oise and speech covariance matrices with RNNs.The rest of the paper is organized as follows: Section 2introduces the conventional mask-based MVDR beamformerand Section 3 describes the proposed ADL-MVDR beam-former. We present the dataset and experimental setup in Sec-tion 4. Results are reported in Section 5. Finally, we drawconclusions in Section 6.
2. SIGNAL MODEL FOR MVDR BEAMFORMER
This section describes the conventional mask-based MVDRbeamformer, the proposed ADL-MVDR beamformer will beintroduced in the next section. Consider a noisy speech mix-ture y = [ y , y , ..., y M ] T recorded with an M -size micro-phone array. Let s represent the clean speech and let n denotethe interfering noise with M channels, then we have Y ( t, f ) = S ( t, f ) + N ( t, f ) , (1)where ( t, f ) indicates the time and frequency indices of theacoustic signals in the T-F domain, and Y , S , N denote thecorresponding variables in T-F domain. The separated speech ˆs MVDR ( t, f ) can be obtained as ˆs MVDR ( t, f ) = h H ( f ) Y ( t, f ) , (2)where h ( f ) ∈ C M represents the MVDR weights at fre-quency index f and H stands for the Hermitian operator. Thegoal of the MVDR beamformer is to minimize the power ofthe noise while keeping the target speech undistorted, whichcan be formulated as h MVDR = arg min h h H Φ NN h s.t. h H v = , (3)here Φ NN stands for the covariance matrix of the noise powerdensity spectrum (PSD) and v ( f ) ∈ C M denotes the steeringvector of the target speech. Different solutions can be usedto derive the MVDR beamforming weights. In our study, wemainly focus the MVDR solution that is based on the steeringvector [17, 22], which can be derived by applying PCA on thespeech covariance matrix. h ( f ) = Φ − NN ( f ) v ( f ) v H ( f ) Φ − NN ( f ) v ( f ) , h ( f ) ∈ C M , (4)note that the matrix inversion and PCA in Eq (4) are not stableespecially when jointly trained with neural networks.A complex ratio mask [23] (denoted as cRM) can be usedto estimate the target speech accurately with less amount ofphase distortion, which benefits human listeners [23, 24]. Inthis case, the estimated speech ˆs cRM and covariance matrixof the speech PSD Φ SS can be computed as ˆS cRM ( t, f ) = cRM S ( t, f ) ∗ Y ( t, f ) , Φ SS ( f ) = (cid:80) Tt =1 ˆS cRM ( t, f ) ˆS H cRM ( t, f ) (cid:80) Tt =1 cRM HS ( t, f )cRM S ( t, f ) , (5) where ∗ denotes the complex multiplier and cRM S representsthe estimated cRM for speech target. The noise covariancematrix Φ NN can be obtained in a similar way. However, thecovariance matrix Φ derived here is on the utterance levelwhich is not optimal for each frame, resulting in high level ofresidual noise.
3. PROPOSED ADL-MVDR BEAMFORMER
In this work, we implement two gated recurrent unit (GRU)[25] based networks (denoted as GRU-Nets) to replace thematrix inversion and PCA in Eq. (4) for frame-level beam-forming weights estimation. One advantage of using RNNsis that it utilizes the weighted information from all previousframes and does not need any heuristic updating factors be-tween consecutive frames as needed in recursive approaches[17, 26].
To better utilize the nearby T-F information and stabilizethe estimated statistical variables (namely, Φ SS ( t, f ) and Φ NN ( t, f ) ), we adopt a complex ratio filtering (denoted ascRF) method [20] to estimate the speech and noise compo-nents. For each T-F bin, the cRF is applied to its (2 K + 1) × (2 L + 1) nearby T-F bins as ˆS cRF ( t, f ) = L (cid:88) τ = − L K (cid:88) τ = − K cRF( t + τ , f + τ ) ∗ Y ( t + τ , f + τ ) , Φ SS ( t, f ) = ˆS cRF ( t, f ) ˆS H cRF ( t, f ) (cid:80) Tt =1 cRM HS ( t, f )cRM S ( t, f ) , (6)here ˆS cRF indicates the estimated speech using the complexratio filter. The cRF is equivalent to (2 K + 1) × (2 L +1) number of cRMs that each applies to the correspondingshifted version (i.e., along time and frequency axes) of thenoisy spectrogram. The frame-level speech covariance ma-trix is then computed where the center mask of the cRF (i.e., cRM S ( t, f ) ) is used for normalization. Note that we do notsum over the time dimension of Φ SS in order to preserve theframe-level temporal information. The frame-level noise co-variance matrix Φ NN ( t, f ) can be obtained in a similar way. Here we propose to estimate the steering vector and the in-verse of noise covariance matrix with two GRU-Nets. TheGRU-Nets can better utilize temporal information from previ-ous frames for statistical terms estimation than conventionalframe-wise approaches that are based on heuristic updatingfactors [17, 26]. Additionally, replacing the matrix inver-sion with a GRU-Net resolves the instability issue during jointtraining with NNs. We hypothesize that these GRU-Nets willlean the steering vector and the matrix inversion through back omplex-valuedReal part (real-valued) Linear layerGRU layersConcatenationReshape ReshapeLinear layerGRU layersConcatenationReshape Reshape iSTFT(Conv-1d)Audio encoding blocks Complex filtersSTFT(Conv-1d) Merge... IPDLPSDF ...
Estimated separated speechDOA Estimation15-channelmixture DilatedConv-1d... Real-valuedImaginary part (real-valued) Neural network blockSpectrograms Frame-levelMVDR weights calculationDirectional feature extraction Complex filter estimator ADL-MVDR (replacing matrix inversion and PCA)
Fig. 1 . Network structure of proposed ADL-MVDR beamformer. ⊗ and (cid:12) indicate the operations expressed in Eq. (6) and (9),respectively. The complex filter estimation (i.e., cRF) and ADL-MVDR (i.e., estimations of v ( t, f ) and Φ − NN ( t, f ) ) blocks arehighlighted in the blue and red dashed boxes, respectively. The real and imaginary parts are reshaped and concatenated beforefed into the GRU networks, and then reshaped again as inputs for MVDR weights calculation. The estimated frame-levelMVDR weights are then applied to the multi-channel speech.propagation, i.e., ˆ v ( t, f ) = GRU - Net v ( Φ SS ( t, f )) , ˆΦ − NN ( t, f ) = GRU - Net NN ( Φ NN ( t, f )) , (7)where the real and imaginary parts of the complex-valuedcovariance matrix Φ are concatenated together as input tothe GRU-Net. Here we assume that the explicitly calculatedspeech and noise covariance matrices are important for RNNsto learn the spatial filtering, which is different from the di-rectly NN-learned beamforming weights in [21]. Leveragingon the temporal structure of RNNs, the model recursively ac-cumulates and updates the covariance matrix for each frame.As shown in Fig. 1, the output of each GRU-Net is fed into alinear layer to obtain the final real and imaginary parts of thecomplex-valued covariance matrix or steering vector. Then,we compute the frame-level ADL-MVDR weights as h ( t, f ) = ˆΦ − NN ( t, f )ˆ v ( t, f ) ˆ v H ( t, f ) ˆΦ − NN ( t, f ) ˆ v ( t, f ) , h ( t, f ) ∈ C M . (8)Here h ( t, f ) is frame-wise and different from the utterance-level weights of conventional mask-based MVDR defined inEq. (4). Finally, the ADL-MVDR enhanced speech is ob-tained ˆS ADL - MVDR ( t, f ) = h H ( t, f ) Y ( t, f ) . (9)
4. DATASET AND EXPERIMENTAL SETUP4.1. System overview and dataset
The proposed system is evaluated based on our previouslyreported multi-modal multi-channel target speech separationplatform [10, 27]. As shown in Fig. 1, we extract the log-power interaural phase difference (IPD) and spectra (LPS)features from the 15-channel microphone recorded mixturethat is synchronized with the ◦ camera [10]. The directionof arrival (DOA) is roughly estimated using the location of thetarget speaker’s face in the whole camera view [10], then thelocation guided directional feature (DF) [28] is extracted. TheDF feature is then merged with the IPD and LPS features be-fore fed into the audio encoding blocks [10, 27]. We use our previously reported Mandarin audio-visual dataset [8, 10, 27]collected from Tencent Video and Youtube as the speech cor-pus (will be released [29] soon). Different from our previousworks [10, 27], the lip movement feature is not fed into themodel in this study as we focus on the beamforming. Thecorpus contains 205500 audio clips (roughly 200 hours) withsampling rate set to 16 kHz. The simulated multi-channel au-dio data contains sources from different speakers (either tar-get or interfering sources). The audios are further mixed withrandom cuts of noises and different reverberation conditionsare applied [10]. A 512-point FFT is applied along with 32 ms Hann windowand 16 ms step size to extract the audio features. The size ofthe cRF is empirically set to 3 × e − . The objective isto maximize the time-domain scale-invariant source-to-noiseratio (Si-SNR) [9]. All models are trained for 60 epochs.In terms of the GRU-Nets, the ˆ v ( t, f ) estimation networkconsists of two layers of GRU followed by another layer offully connected (FC) neurons. The hidden size is set to 500and 250 for the 2-layer GRU with tanh activation function,linear activation function is used for the FC layer with a hid-den size of 30. As for ˆΦ − nn ( t, f ) estimation, the correspond-ing GRU-Net features a similar structure, where each GRUlayer contains 500 units with a 450-size FC layer.Four baseline systems are considered, including a purelyNN-based (i.e., a Conv-TasNet variant [9]) cRM system [8],a purely NN-based cRF system, a conventional MVDR-basedcRM system [8] and another MVDR-based cRF system (de-noted as NN with cRM, NN with cRF, MVDR with cRM,and MVDR with cRF, respectively). We further include twomulti-tap (i.e., [ t, t − ) MVDR systems that are proposed inour previous work [8], trained with cRM and cRF (denoted asMulti-tap MVDR with cRM/cRF). able 1 . Experimental results for different speech separation systems across objective evaluation metrics. Systems/Metrics PESQ ∈ [ − . , . Si-SNR (dB) SDR (dB) WER ( % )0-15 ◦ ◦ ◦ ◦ ∞ ∞ ×
3) 2.75 2.95 3.12 3.09 3.98 3.06 2.76 3.10 12.50 13.01 22.07MVDR with cRM [8] 2.55 2.76 2.96 2.84 3.73 2.88 2.56 2.90 10.62 12.04 16.85MVDR with cRF (3 ×
3) 2.55 2.77 2.96 2.89 3.82 2.90 2.55 2.92 11.31 12.58 15.91Multi-tap MVDR with cRM (2-tap) [8] 2.70 2.96 3.18 3.09 3.80 3.07 2.74 3.08 12.56 14.11 13.67Multi-tap MVDR with cRF (2-tap, 3 ×
3) 2.67 2.95 3.15 3.10 3.92 3.06 2.72 3.08 12.66 14.04 13.52
Proposed ADL-MVDR with cRF (3 ×
3) 3.04 3.30 3.48 3.48 4.17 3.41 3.07 3.42 14.80 15.45 12.73
Multi-tap MVDR with cRF (2-tap)Proposed ADL-MVDR with cRFMVDR with cRFNN with cRFNN with cRMReverberant clean (reference)
Fig. 2 . Sample spectrograms of some evaluated systems
5. RESULTS AND DISCUSSIONS
The systems’ performance is evaluated by several objectivemetrics, including PESQ, Si-SNR [9], and SDR. A Tencentcommercial mandarin speech recognition API [31] is used formeasuring the WER in this study. The PESQ scores are fur-ther presented in specific conditions (i.e., the angle betweenthe target speaker and the closest interfering source, and thenumber of speakers). Average scores for other metrics arepresented. Note that the systems are only trained on speechseparation and denoising, without dereverberation. ADL-MVDR vs. NN : the proposed ADL-MVDR sys-tem achieves significantly better results across all metrics andASR accuracy than purely NN-based systems. Our proposedADL-MVDR system achieves around 42% improvement onWER (i.e., 12.73% vs. 22.07%) when compared to NN withcRF. Significant improvements aross objective metrics arealso observed (i.e., PESQ: 3.42 vs. 3.10, Si-SNR: 14.80 dBvs. 12.50 dB, SDR: 15.45 dB vs. 13.01 dB). For purely NN-based systems, although they perform reasonably well acrossobjective metrics, they perform poorly in ASR system due tolarge amount of distortion (also highlighted in Fig. 2).
ADL-MVDR vs. MVDR : our proposed ADL-MVDRsystem achieves about 17% PESQ improvement over the Demos (including real-world recording evaluations) are available athttps://zzhang68.github.io/adlmvdr/ baseline MVDR system with cRF (i.e., 3.42 vs. 2.92). Interms of ASR accuracy, the proposed ADL-MVDR systemoutperforms MVDR with cRF by a large margin (i.e., 12.73%vs. 15.91%). Considering that the commercial ASR sys-tem is already robust to some mild noise, the differences onWER become smaller for multi-tap MVDR systems, yet largegaps can be found in all other metrics (e.g., 0.34 absolute im-provement on average PESQ). Although conventional MVDRsystems can alleviate the distortion issue, there still remainsa lot of residual noise. This can be observed in the objectivescores and is also highlighted in Fig. 2. Again, our proposedADL-MVDR system resolves this issue (i.e., Si-SNR: 14.80dB vs. 12.66 dB and SDR: 15.45 dB vs. 14.04 dB) while alsokeeping the target speech undistorted. Specifically, under ex-treme conditions where interfering sources are very close toeach other (e.g., angles between 0-15 ◦ ), our proposed ADL-MVDR system improves the speech quality by nearly 62%(i.e., PESQ: 3.04 vs. 1.88). The experimental results pre-sented here verify our claims that the proposed ADL-MVDRsystem not only ensures the distortionless of the target speech(i.e., lowest WER) but also eliminates the residual noise (i.e.,highest scores across all objective metrics). cRM vs. cRF : the NN with cRF achieves better perfor-mance in all metrics (e.g., Si-SNR: 12.50 dB vs. 12.23 dB)and ASR accuracy (i.e., 22.07% vs. 22.49%) than NN withcRM. Slight improvements can be found on conventionalMVDR systems due to utterance-level weights. The cRF ismore important for ADL-MVDR system since ADL-MVDRis recursively getting frame-level weights from the estimatedcovariance matrices. It indicates that the benefits of introduc-ing T-F filtering include further reducing the residual noisewhile not distorting the speech.
6. CONCLUSIONS AND FUTURE WORK
In this paper, we proposed a novel all deep learning MVDRmethod to recursively learn the spatio-temporal filtering formulti-channel target speech separation. The proposed systemoutperforms prior arts across several objective metrics andASR accuracy. The future of our proposed ADL-MVDRframework is promising and it could be generalized to manyother speech separation systems. We will further verify thisidea on single-channel speech separation and move on todereverberation tasks. . REFERENCES [1] T. Van den Bogaert, S. Doclo, J. Wouters, and M. Moo-nen, “Speech enhancement with multichannel wienerfilter techniques in multimicrophone binaural hearingaids,”
JASA , vol. 125, no. 1, pp. 360–371, 2009.[2] J. Du, Q. Wang, T. Gao, Y. Xu, L.-R. Dai, and C.-H.Lee, “Robust speech recognition with speech enhanceddeep neural networks,” in
Interspeech , 2014.[3] F. Weninger, H. Erdogan, et al., “Speech enhancementwith lstm recurrent neural networks and its applicationto noise-robust asr,” in
LVA/ICA . Springer, 2015, pp.91–99.[4] J. Yu, B. Wu, et al., “Audio-visual multi-channel recog-nition of overlapped speech,”
Interspeech , 2020.[5] Y. Wang, A. Narayanan, and D. Wang, “On trainingtargets for supervised speech separation,”
IEEE TASLP ,vol. 22, no. 12, pp. 1849–1858, 2014.[6] Y. Luo and N. Mesgarani, “Tasnet: time-domain audioseparation network for real-time, single-channel speechseparation,” in
ICASSP , 2018, pp. 696–700.[7] Z. Zhang, C. Deng, Y. Shen, D. S. Williamson, Y. Sha,Y. Zhang, H. Song, and X. Li, “On loss functions andrecurrency training for gan-based speech enhancementsystems,” arXiv preprint arXiv:2007.14974 , 2020.[8] Y. Xu, M. Yu, et al., “Neural spatio-temporal beam-former for target speech separation,” arXiv preprintarXiv:2005.03889 , 2020.[9] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpass-ing ideal time–frequency magnitude masking for speechseparation,”
IEEE TASLP , vol. 27, no. 8, pp. 1256–1266,2019.[10] K. Tan, Y. Xu, S.-X. Zhang, M. Yu, and D. Yu, “Audio-visual speech separation and dereverberation with a two-stage multimodal network,”
IEEE J-STSP , 2020.[11] Barry D. Van V. and K. M. Buckley, “Beamforming: Aversatile approach to spatial filtering,”
IEEE assp mag-azine , vol. 5, no. 2, pp. 4–24, 1988.[12] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neu-ral network based spectral mask estimation for acousticbeamforming,” in
ICASSP , 2016, pp. 196–200.[13] H. Erdogan, J. R. Hershey, et al., “Improved mvdrbeamforming using single-channel mask prediction net-works.,” in
Interspeech , 2016, pp. 1981–1985.[14] Y. Xu, C. Weng, L. Hui, J. Liu, M. Yu, D. Su, and D. Yu,“Joint training of complex ratio mask based beamformerand acoustic model for noise robust asr,” in
ICASSP ,2019, pp. 6745–6749.[15] X. Xiao, S. Zhao, D. L. Jones, E. S. Chng, and H. Li,“On time-frequency mask estimation for mvdr beam-forming with application in robust speech recognition,”in
ICASSP , 2017, pp. 3246–3250.[16] C. Boeddeker, H. Erdogan, T. Yoshioka, and R. Haeb-Umbach, “Exploring practical aspects of neural mask- based beamforming for far-field speech recognition,” in
ICASSP , 2018, pp. 6697–6701.[17] T. Higuchi, K. Kinoshita, N. Ito, S. Karita, andT. Nakatani, “Frame-by-frame closed-form update formask-based adaptive mvdr beamforming,” in
ICASSP ,2018, pp. 531–535.[18] J. Wang, “A recurrent neural network for real-time ma-trix inversion,”
Applied Mathematics and Computation ,vol. 55, no. 1, pp. 89–100, 1993.[19] Y. Zhang and S. S. Ge, “Design and analysis of a generalrecurrent neural network model for time-varying matrixinversion,”
IEEE Trans. on Neural Networks , vol. 16,no. 6, pp. 1477–1490, 2005.[20] W. Mack and E. A. Habets, “Deep filtering: Sig-nal extraction and reconstruction using complex time-frequency filters,”
IEEE Signal Processing Letters , vol.27, pp. 61–65, 2019.[21] X. Xiao, C. Xu, and et al., “A study of learningbased beamforming methods for speech recognition,” in
CHiME 2016 workshop , 2016.[22] K. Shimada, Y. Bando, M. Mimura, K. Itoyama,K. Yoshii, and T. Kawahara, “Unsupervised beamform-ing based on multichannel nonnegative matrix factoriza-tion for noisy speech recognition,” in
ICASSP , 2018, pp.5734–5738.[23] D. S. Williamson, Y. Wang, and D. Wang, “Complexratio masking for monaural speech separation,”
IEEETASLP , vol. 24, no. 3, pp. 483–492, 2015.[24] Z. Zhang, D. S. Williamson, and Y. Shen, “Investi-gation of phase distortion on perceived speech qual-ity for hearing-impaired listeners,” arXiv preprintarXiv:2007.14986 , 2020.[25] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Em-pirical evaluation of gated recurrent neural networks onsequence modeling,” arXiv preprint arXiv:1412.3555 ,2014.[26] M. Tammen, D. Fischer, and S. Doclo, “Dnn-based multi-frame mvdr filtering for single-microphone speech enhancement,” arXiv preprintarXiv:1905.08492 , 2019.[27] R. Gu, S.-X. Zhang, Y. Xu, L. Chen, Y. Zou, and D. Yu,“Multi-modal multi-channel target speech separation,”
IEEE J-STSP , 2020.[28] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, andY. Gong, “Multi-channel overlapped speech recogni-tion with location guided speech extraction network,” in
IEEE SLT , 2018, pp. 558–565.[29] S.-X. Zhang, Y. Xu, M. Yu, L. Chen, and D. Yu, “Multi-modal multi-channel system and corpus for cocktailparty problems,” in
Preparation , 2020.[30] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980