Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation
GGENERALIZED RNN BEAMFORMER FOR TARGET SPEECH SEPARATION †∗ Yong Xu, ‡∗ Zhuohuang Zhang, † Meng Yu, † Shi-Xiong Zhang, § Lianwu Chen, † Dong Yu † Tencent AI Lab, Bellevue, USA ‡ Indiana University, USA § Tencent AI Lab, Shenzhen, China
ABSTRACT
Recently we proposed an all-deep-learning minimum vari-ance distortionless response (ADL-MVDR) method wherethe unstable matrix inverse and principal component analysis(PCA) operations in the MVDR were replaced by recurrentneural networks (RNNs). However, it is not clear whetherthe success of the ADL-MVDR is owed to the calculatedcovariance matrices or following the MVDR formula. Inthis work, we demonstrate the importance of the calculatedcovariance matrices and propose three types of generalizedRNN beamformers (GRNN-BFs) where the beamformingsolution is beyond the MVDR and optimal. The GRNN-BFscould predict the frame-wise beamforming weights by lever-aging on the temporal modeling capability of RNNs. Theproposed GRNN-BF method obtains better performance thanthe state-of-the-art ADL-MVDR and the traditional mask-based MVDR methods in terms of speech quality (PESQ),speech-to-noise ratio (SNR), and word error rate (WER).
Index Terms — MVDR, Generalized RNN beamformer,ADL-MVDR, GEV, speech separation
1. INTRODUCTION
The MVDR beamformer with NN-predicted masks [1, 2, 3,4, 5] could always achieve less non-linear distortion thanthe purely neural network (NN) based speech separation ap-proaches [6, 7, 8, 9, 10]. However, the residual noise level ofmask-based MVDR method is high [5]. Most of the mask-based beamformers are optimized in the chunk-level [1, 5].The calculated beamforming weights are hence chunk-levelwhich is not optimal for each frame. Furthermore, the matrixinverse and PCA processes involved in the traditional beam-formers (e.g., MVDR, GEV, etc.) are not stable, especiallywhen they are jointly trained with neural networks. Time-varying beamformers were investigated in [11, 12], howeverthey still had the instability problem.To overcome the above-mentioned issues, recently weproposed an all-deep-learning MVDR (ADL-MVDR) method[13] which was superior to the traditional MVDR beam-former [5]. In the ADL-MVDR [13], the matrix inverseand PCA operations are replaced by two recurrent neural
This work was done when Z. Zhang was an intern in Tencent AI Lab,Bellevue, WA, USA. * Equal contribution, [email protected] networks (RNNs) with the calculated speech and noise co-variance matrices as the input. Significant residual noisereduction capability and better speech recognition accuracywere achieved [13]. Note that Xiao [14] once proposed alearning based beamforming method which was worse thanthe mask-based MVDR approach due to the lack of explicitlyusing the speech and noise covariance matrices information.However, it is not clear whether the advantage of theADL-MVDR is coming from the calculated covariance ma-trices or following the MVDR formula. There are thee contri-butions in this work. First, we demonstrate that the calculatedspeech and noise covariance matrices are the key factors.This finding is important because it paves the road to designmore optimal deep learning based beamformers. Second,three types of generalized RNN beamformers (GRNN-BFs)are proposed. Leveraging on the success of RNNs to solvethe matrix inverse and eigenvalue decomposition problems[15, 16, 17, 18], any kinds of traditional beamformers (e.g.,MVDR [2], GEV [1], Multichannel Wiener filtering [19],etc.) could be solved by the RNNs. However, only RNN-MVDR was investigated [15]. The proposed GRNN-BFachieves the best performance among all methods. Finally,we not only generalize the beamforming formula, but alsogeneralize the RNN-BF model from two branches to one.The rest of this paper is organized as follows. In section2, mask based traditional beamformers are described. Section3 presents the proposed generalized RNN-based beamform-ers. The experiments and results are provided in Section 4.Section 5 concludes this paper.
2. MASK BASED TRADITIONAL BEAMFORMERS
In the mask based beamforming [1, 2, 4, 5], the masks couldbe real-valued masks or complex-valued masks [5]. The tar-get speech covariance matrix Φ SS is calculated as, Φ SS ( f ) = (cid:80) Tt =1 M S ( t, f ) Y ( t, f ) Y H ( t, f ) (cid:80) Tt =1 M S ( t, f ) (1)Where Y ( t, f ) is the short-time Fourier transform (STFT) ofmulti-channel mixture signal at t -th time frame and f -th fre-quency bin. M S denotes the mask of the target speech. T stands for the total number of frames in a chunk. H is the a r X i v : . [ c s . S D ] J a n ermitian transpose. Then the solution for MVDR is, w MVDR ( f ) = Φ − NN ( f ) v ( f ) v H ( f ) Φ − NN ( f ) v ( f ) , w MVDR ( f ) ∈ C C (2) where v ( f ) stands for the steering vector at f -th frequencybin. v ( f ) could be calculated by applying PCA on Φ SS ( f ) ,namely v ( f ) = P{ Φ SS ( f ) } . w MVDR ( f ) is a C -channelcomplex-valued vector at f -th frequency bin.Another beamformer is generalized eigenvalue (GEV)[1]. The solution is derived by maximizing the SNR. w GEV ( f ) = argmax w w H ( f ) Φ SS ( f ) w ( f ) w H ( f ) Φ NN ( f ) w ( f ) (3)This optimization problem leads to the generalized eigenvaluedecomposition (GEVD) [1]. The optimal solution is the gen-eralized principal component [1, 20], w GEV ( f ) = P{ Φ − NN ( f ) Φ SS ( f ) } , w GEV ( f ) ∈ C C (4)However, the matrix inverse involved in Eq. (2) and Eq. (4),the PCA needed for solving the steering vector in MVDR, andGEVD required in Eq. (4) are not stable when jointly trainedwith neural networks in the back-propagation process.
3. RNN-BASED BEAMFORMERS3.1. All deep learning MVDR (ADL-MVDR)
Recently we proposed the ADL-MVDR method [13] to over-come the issues of traditional methods mentioned above. TheADL-MVDR used two RNNs to replace the matrix inverseand PCA in MVDR. As suggested in [13], complex-valued ra-tio filter (cRF) [21] rather than the complex-valued ratio mask(cRM) [22] was used to calculate the covariance matrices.The cRF used (2 K + 1) x (2 K + 1) neighboring context (i.e.,surrounding (2 K +1) frames and (2 K +1) frequency bins) tostabilize the calculation of covariance matrices at ( t, f ) . Thenthe estimated target speech is, ˆ S ( t, f ) = τ = K (cid:88) τ = − K τ = K (cid:88) τ = − K cRF ( t + τ , f + τ ) ∗ Y ( t + τ , f + τ ) (5) If K = 0 , then cRF [21] is exactly the same with cRM [22].The frame-wise speech covariance matrix is calculated as, Φ SS ( t, f ) = ˆ S ( t, f )ˆ S H ( t, f ) (cid:80) Tt =1 cRF H ( t, f ) cRF ( t, f ) (6) The temporal sequence of speech and noise covariancematrices are fed into two RNNs to hypothetically learn thesteering vector and the matrix inversion, respectively. ˆ v ( t, f ) = RNN ( Φ SS ( t, f )) (7) ˆΦ − NN ( t, f ) = RNN ( Φ NN ( t, f )) (8) The uni-directional RNNs could automatically accumulateand update the covariance matrices from the history frames.Finally, the frame-wise beamforming weights are, w ADL-MVDR ( t, f ) = ˆΦ − NN ( t, f ) ˆ v ( t, f ) ˆ v H ( t, f ) ˆΦ − NN ( t, f ) ˆ v ( t, f ) (9) However, it is not clear whether the key factor in ADL-MVDR is the calculated covariance matrices or followingthe MVDR formula (as shown in Eq. (9)). Here we proposedifferent types of generalized RNN beamformers (GRNN-BFs). Similar to the ADL-MVDR, the GRNN-BF module(as shown in Fig. 1 takes the calculated the target speech co-variance matrix Φ SS ( t, f ) (Eq. (6)) and the noise covariancematrix Φ NN ( t, f ) as the input to predict the beamformingweights in different forms. As defined in Eq. (4), the solution for GEV beamformer isthe generalized principal component. However, we propose ageneralized RNN-GEV (GRNN-GEV) as shown in Fig. 1. ˆΦ − NN ( t, f ) = RNN ( Φ NN ( t, f )) (10) ˆΦ SS ( t, f ) = RNN ( Φ SS ( t, f )) (11) w GRNN-GEV ( t, f ) = DNN ( ˆΦ − NN ( t, f ) ˆΦ SS ( t, f )) (12) ˆ S ( t, f ) = ( w GRNN-GEV ( t, f )) H Y ( t, f ) (13)where w GRNN-GEV ( t, f ) ∈ C C . ˆΦ SS ( t, f ) is the accumulatedspeech covariance matrix from the history frames by leverag-ing on the temporal modeling capability of RNNs. Instead ofusing the actual PCA (as in Eq. (4)), a deep neural network(DNN) was utilized to calculate the beamforming weights forGRNN-GEV. Hinton et al [23] shows that the DNN has theability to conduct the non-linear generalized PCA. Similar to GRNN-GEV, we also propose another general-ized RNN beamformer (GRNN-BF-I). As shown in Fig. 1, ˆΦ − NN ( t, f ) and ˆΦ SS ( t, f ) are concatenated rather then multi-plied as the GRNN-GEV did. w GRNN-BF-I ( t, f ) = DNN ([ ˆΦ − NN ( t, f ) , ˆΦ SS ( t, f )]) (14)where w GRNN-BF-I ( t, f ) ∈ C C . We try to predict the beam-forming weights directly, without following any traditionalbeamformers’ formulas (e.g., MVDR or GEV). Note thatthe physical meaning of ˆΦ − NN ( t, f ) and ˆΦ SS ( t, f ) (after pro-cessed by RNNs as in Eq. (10) and (11)) might be changed inGRNN-BF-I due to the more flexible optimization. All of thecovariance matrices and beamforming weights are complex-valued, and we concatenate the real and imaginary parts ofany complex-valued matrices or vectors in the whole work. NN RNN
Linear NN Linear NN 𝚽 𝐒𝐒 𝑡, 𝑓 𝚽 𝐍𝐍 𝑡, 𝑓𝒘 𝐀𝐃𝐋−𝐌𝐕𝐃𝐑 𝑡, 𝑓 =𝚽 𝑵𝑵 −𝟏 (𝑡, 𝑓)ෝ𝒗 (𝑡, 𝑓)ෝ𝒗 𝑡, 𝑓 H 𝚽 𝑵𝑵−𝟏 (𝑡, 𝑓)ෝ𝒗(𝑡, 𝑓)
RNN RNN
Linear NN Linear NN 𝚽 𝐒𝐒 𝑡, 𝑓 𝚽 𝐍𝐍 𝑡, 𝑓 MatMul( 𝚽 𝐍𝐍−𝟏 𝑡, 𝑓 , 𝚽 𝐒𝐒 𝑡, 𝑓 ) Non-linear DNN (non-linear PCA)
Linear NN 𝒘 𝐑𝐍𝐍−𝐆𝐄𝐕 (𝑡, 𝑓)
RNN RNN
Linear NN Linear NN 𝚽 𝐒𝐒 𝑡, 𝑓 𝚽 𝐍𝐍 𝑡, 𝑓 𝐂𝐨𝐧𝐜𝐚𝐭𝐞[𝚽
𝐍𝐍−𝟏 𝑡, 𝑓 , 𝚽 𝐒𝐒 𝑡, 𝑓 ] Non-linear DNN
Linear NN 𝒘 𝐆𝐑𝐍𝐍−𝐁𝐅−𝐈 (𝑡, 𝑓)
RNN 𝚽 𝐒𝐒 𝑡, 𝑓 𝚽 𝐍𝐍 𝑡, 𝑓 Non-linear DNN
Linear NN 𝒘 𝐆𝐑𝐍𝐍−𝐁𝐅−𝐈𝐈 (𝑡, 𝑓)
𝐂𝐨𝐧𝐜𝐚𝐭𝐞[𝚽 𝐒𝐒 𝑡, 𝑓 , 𝚽 𝐍𝐍 𝑡, 𝑓 ] STFT(Conv-1d) … IPDLPS
Multi-channelMixture 𝒚 Estimated
Target DOA 𝜽 Directional feature extractor 𝒅(𝜽)
Dilated Conv-1D blocks
Predicted target waveform ො𝒔 Back-propagation though all of the modules
Si-SNR loss= ||𝜶 ∙ 𝒔|| ||ො𝒔 − 𝜶 ∙ 𝒔|| RNN-BF iSTFT(Conv-1d) 𝐜𝐑𝐅 𝐒 (t,f) 𝐜𝐑𝐅 𝐍 (t,f) 𝚽 𝐒𝐒 (t,f) 𝚽 𝐍𝐍 (t,f) 𝒀 ADL-MVDR: Proposed GRNN-GEV:
Proposed GRNN-BF-I: Proposed GRNN-BF-II: 𝚽 𝑵𝑵 −𝟏 (𝑡, 𝑓)ෝ𝒗 (𝑡, 𝑓) 𝚽 𝐒𝐒 𝑡,𝑓 𝚽 𝐍𝐍−𝟏 𝑡,𝑓 𝚽 𝐒𝐒 𝑡,𝑓 𝚽 𝐍𝐍−𝟏 𝑡,𝑓
Fig. 1 . Joint training of the proposed generalized RNN-based beamformers (GRNN-BFs) with the dilated conv-1d basedcomplex-valued ratio filters (cRFs). α = ˆs T s / s T s is a scaling factor in the time-domain scale-invariant SNR (Si-SNR) loss [7]. Finally, we also propose GRNN-BF-II. As shown in Fig. 1,the model structure for predicting the beamforming weightsfrom the calculated covariance matrices Φ SS ( t, f ) and Φ NN ( t, f ) are further generalized from two branches to one branch. w GRNN-BF-II ( t, f ) = RNN-DNN ([ Φ NN ( t, f ) , Φ SS ( t, f )]) (15) where w GRNN-BF-II ( t, f ) ∈ C C . The input for the RNN-DNNis the concatenated tensor of Real ( Φ SS ( t, f )) , Imag ( Φ SS ( t, f )) ,Real ( Φ NN ( t, f )) and Imag ( Φ NN ( t, f )) .
4. EXPERIMENTAL SETUP AND RESULTSDataset : The methods are evaluated on the mandarin audio-visual corpus [24, 25], which is collected from Tencent Videoand YouTube (will be released [26] soon). The dataset has205500 clean speech segments (about 200 hours) over 1500speakers. The sampling rate for audio is 16 kHz. 512-pointof STFT is used to extract audio features along 32ms Hannwindow with 50% overlap. The multi-talker multi-channeldataset are simulated in the similar way with our previouswork [24, 25]. We use a 15-element non-uniform linear array.Based on the image-source simulation method [27], the sim-ulated dataset contains 190000, 15000 and 500 multi-channelmixtures for training, validation and testing. The room size isranging from 4m-4m-2.5m to 10m-8m-6m. The reverberationtime T60 is sampled in a range of 0.05s to 0.7s. The signal-to-interference ratio (SIR) is ranging from -6 to 6 dB. Also, noisewith 18-30 dB SNR is added to all the multi-channel mixtures[24]. A commercial general-purpose mandarin speech recog-nition Tencent API [28] is used to test the ASR performance. cRF estimator: we use the complex-valued ratio filter(cRF) [13, 21] rather than the complex-valued ratio mask(cRM) [5, 22] to calculate the covariance matrices (shownin Eq. (5)). The input to the cRF estimator includes the15-channel mixture audio and the estimated target DOA ( θ ).From the multi-channel audio, log-power spectra (LPS) andinteraural phase difference (IPD) [24] features are extracted.We have the hardware where the ◦ wide-angle camera andthe 15 linear microphones are aligned [5]. Hence the targetDOA ( θ ) could be roughly estimated from the camera view bylocating the target speaker’s face. Then the location guideddirectional feature (DF) [29], namely d ( θ ) , is estimated bycalculating the cosine similarity between the target steeringvector v and IPDs [29, 25]. d ( θ ) is a speaker-dependentfeature. The LPS, IPDs and d ( θ ) are merged and fed into abunch of dilated 1D-CNNs which is similar to the structureof Conv-TasNet [7]. The details can be found in our previouswork [24, 25]. K for cRF is one. The RNNs are 2-layerunidirectional gated recurrent units (GRUs) with 500 hiddennodes. The DNNs have 2-layer 500 hidden units with ReLUactivation functions. The model is trained in a chunk-wisemode with 4-second chunk size, using Adam optimizer. Ini-tial learning rate is set to 1e-3. Pytorch 1.1.0 was used. Weevaluate the systems with different metrics, including PESQ,Si-SNR (dB), SDR (dB) and word error rate (WER). Three types of generalized RNN beamformers are proposed inthis work. The proposed GRNN-GEV still follows the GEVbeamformer formula (defined in Eq. (4)) but uses a DNNto replace the generalized PCA. The proposed GRNN-BF-Iand GRNN-BF-II directly predict the beamforming weights able 1 . PESQ, Si-SNR [7], SDR and WER results among purely NN, MVDR and proposed GRNN-BF systems. systems/metrics PESQ ∈ [-0.5, 4.5] Si-SNR (dB) SDR (dB) WER(%)Angle between target speaker & others ◦ ◦ ◦ ◦ ∞ ∞ Prop. GRNN-GEV
Prop. GRNN-BF-I
Prop. GRNN-BF-II 3.17 3.40 3.58 3.59 4.21 3.53 3.19 3.52 15.48 16.03 11.86 from the calculated frame-wise covariance matrices, ratherthan following any traditional beamforming formulas. Theproposed three generalized RNN beamformers have the sim-ilar performance. The proposed GRNN-BF-II achieves thebest performance in terms of PESQ, Si-SNR, SDR and WER.Compared to the ADL-MVDR [13] where it still follows theclassic MVDR formula (defined in Eq. (9)), GRNN-BF-IIcould obtain better performance with 0.1 higher PESQ andrelatively 6.8% WER reduction. It indicates that there is noneed to follow any formulas of the traditional beamformers’solutions in the RNN-based beamforming framework. Thecalculated target speech and noise covariance matrices playa key role to the success of GRNN-BFs. Compared to thejointly trained cRF-based traditional MVDR [5] or multi-tapMVDR [5] systems, the proposed GRNN-BF-II significantlyimproves the objective measures, e.g., increasing PESQ from3.08 to 3.52 on average. GRNN-BF-II also gets lower WER,namely reducing WER from 13.52% to 11.86%. It indicatesthat the GRNN-BF-II could obtain the optimal beamformingweights for each frame and achieve better noise reduction ca-pability. For the purely NN system without the beamformingmodule, we only multiply the estimated speech cRF to the 0-th channel of Y (as in Eq. (5)) to obtain the final enhancedspeech. It shows the worst WER 22.07% among all systemsdue to the non-linear distortion introduced by the purely NN. Fig. 2 shows the example spectrograms and beam pat-terns of the MVDR and the proposed GRNN-BF-II. Moredemos (including real-world recording demos) could befound at: https://yongxuustc.github.io/grnnbf. The traditionalcRF/mask based MVDR method has more residual noise(shown in the dashed rectangle). Its corresponding beampattern is not sharp enough. However the GRNN-BF-II couldsignificantly reduce most of the residual noise and keep theseparated speech nearly distortionless (also demonstrated bythe lowest WER 11.86% in Table 1.). The sharp beam pattern(selected from one of frames) of the proposed GRNN-BF-IIin Fig. 2 also reflects its noise reduction capability.
Mixture (2-speaker overlapped speech + non-stationary noise) Reverberant clean (reference)MVDRBeam pattern of the MVDR at frequency 1343Hz Beam pattern of the proposed
GRNN-BF-II at frequency 1343Hz
Proposed GRNN-BF-II
Target DOA=110 ° Fig. 2 . Spectrograms and beam patterns of different systems.
5. CONCLUSIONS AND FUTURE WORK
In summary, we proposed three types of generalized RNNbeamformers (GRNN-BFs). The GRNN-BFs are beyond ourprevious ADL-MVDR work in the beamforming solution andmodel structure designs. We conclude the finding that the cal-culated speech and noise covariance matrices are the most im-portant factors. There is no need to follow the formulas of anytraditional beamformers. The GRNN-BF could automaticallyaccumulate and update the covariance matrices from the his-tory frames and predict the frame-wise beamforming weightsdirectly. The proposed GRNN-BF-II could achieve the bestobjective scores and the lowest WER among all systems. Inthe future work, we will prove the mathematical principlesbehind the GRNN-BFs, and investigate the GRNN-BF’s ca-pability on the joint separation and dereverberation. . REFERENCES [1] Jahn Heymann, Lukas Drude, and et al., “Neural net-work based spectral mask estimation for acoustic beam-forming,” in
ICASSP , 2016.[2] Hakan Erdogan, John R Hershey, and et al., “Improvedmvdr beamforming using single-channel mask predic-tion networks.,” in
Interspeech , 2016.[3] Jahn Heymann, Lukas Drude, and et al., “Beamnet:End-to-end training of a beamformer-supported multi-channel ASR system,” in
ICASSP , 2017.[4] Xiong Xiao and et al., “On time-frequency mask estima-tion for MVDR beamforming with application in robustspeech recognition,” in
ICASSP , 2017.[5] Yong Xu and et al., “Neural spatio-temporal beam-former for target speech separation,”
Interspeech , 2020.[6] Yuxuan Wang, Arun Narayanan, and DeLiang Wang,“On training targets for supervised speech separation,”
IEEE/ACM trans. on ASLP , vol. 22, no. 12, pp. 1849–1858, 2014.[7] Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpass-ing ideal time-frequency magnitude masking for speechseparation,”
IEEE/ACM trans. on ASLP , vol. 27, no. 8,pp. 1256–1266, 2019.[8] Dong Yu, Morten Kolbæk, and et al., “Permutation in-variant training of deep models for speaker-independentmulti-talker speech separation,” in
ICASSP , 2017.[9] Yong Xu, Jun Du, and et al., “A regression approach tospeech enhancement based on deep neural networks,”
IEEE/ACM trans. on ASLP , vol. 23, no. 1, pp. 7–19,2014.[10] Jun Du, Qing Wang, and et al., “Robust speech recog-nition with speech enhanced deep neural networks,” in
Interspeech , 2014.[11] Zhong-Qiu Wang and et al., “Sequential multi-frameneural beamforming for speech separation and enhance-ment,” arXiv preprint arXiv:1911.07953 , 2019.[12] Yuki Kubo, Tomohiro Nakatani, and et al., “Mask-basedmvdr beamformer for noisy multisource environments:introduction of time-varying spatial covariance model,”in
ICASSP , 2019.[13] Zhuohuang Zhang, Yong Xu, and et al., “ADL-MVDR:All deep learning MVDR beamformer for target speechseparation,” arXiv preprint arXiv:2008.06994 , 2020.[14] Xiong Xiao, Chenglin Xu, and et al., “A study of learn-ing based beamforming methods for speech recogni-tion,” in
CHiME 2016 workshop , 2016.[15] Yunong Zhang and Shuzhi Sam Ge, “Design and analy-sis of a general recurrent neural network model for time-varying matrix inversion,”
IEEE Transactions on NeuralNetworks , vol. 16, no. 6, pp. 1477–1490, 2005. [16] Jun Wang, “A recurrent neural network for real-timematrix inversion,”
Applied Mathematics and Computa-tion , vol. 55, no. 1, pp. 89–100, 1993.[17] Lijun Liu, Hongmei Shao, and et al., “Recurrent neu-ral network model for computing largest and smallestgeneralized eigenvalue,”
Neurocomputing , vol. 71, no.16-18, pp. 3589–3594, 2008.[18] Xuezhong Wang, Maolin Che, and et al., “Recurrentneural network for computation of generalized eigen-value problem with real diagonalizable matrix pair andits applications,”
Neurocomputing , vol. 216, pp. 230–241, 2016.[19] Tim Van den Bogaert, Simon Doclo, and et al., “Speechenhancement with multichannel wiener filter techniquesin multimicrophone binaural hearing aids,”
The Journalof the Acoustical Society of America , vol. 125, no. 1, pp.360–371, 2009.[20] Francois Grondin, Jean-Samuel Lauzon, and et al.,“GEV beamforming supported by DOA-based masksgenerated on pairs of microphones,” arXiv preprintarXiv:2005.09587 , 2020.[21] Wolfgang Mack and Emanu¨el AP Habets, “Deep filter-ing: Signal extraction and reconstruction using complextime-frequency filters,”
IEEE Signal Processing Letters ,vol. 27, pp. 61–65, 2019.[22] Donald S Williamson, Yuxuan Wang, and et al., “Com-plex ratio masking for monaural speech separation,”
IEEE/ACM trans. on ASLP , vol. 24, no. 3, pp. 483–492,2015.[23] Geoffrey E Hinton and Ruslan R Salakhutdinov, “Re-ducing the dimensionality of data with neural networks,”
Science , vol. 313, no. 5786, pp. 504–507, 2006.[24] Ke Tan, Yong Xu, and et al., “Audio-visual speech sepa-ration and dereverberation with a two-stage multimodalnetwork,”
IEEE Journal of Selected Topics in SignalProcessing , 2020.[25] Rongzhi Gu, Shi-Xiong Zhang, and et al., “Multi-modalmulti-channel target speech separation,”
IEEE Journalof Selected Topics in Signal Processing , 2020.[26] Shi-Xiong Zhang, Yong Xu, and et al., “ M : Multi-Modal Multi-channel dataset for cocktail party prob-lems,” to be released , 2020.[27] Emanuel AP Habets, “Room impulse response genera-tor,” Technische Universiteit Eindhoven, Tech. Rep , vol.2, no. 2.4, 2006.[28] “Tencent ASR,” https://ai.qq.com/product/aaiasr.shtml.[29] Zhuo Chen, Xiong Xiao, and et al, “Multi-channel over-lapped speech recognition with location guided speechextraction network,” in