[PDF] End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

Abstract

Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel multi-speaker reverberant condition, and propose to extend our previous framework for end-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend subnetworks including voice activity detection like masks. The techniques significantly stabilize the end-to-end training process. The experiments on the spatialized wsj1-2mix corpus show that the proposed system achieves about 35% WER relative reduction compared to our conventional multi-channel E2E ASR system, and also obtains decent speech dereverberation and separation performance (SDR=12.5 dB) in the reverberant multi-speaker condition while trained only with the ASR criterion.

Full PDF

EEND-TO-END DEREVERBERATION, BEAMFORMING, AND SPEECH RECOGNITIONWITH IMPROVED NUMERICAL STABILITY AND ADVANCED FRONTEND

Wangyou Zhang , Christoph Boeddeker , Shinji Watanabe , Tomohiro Nakatani ,Marc Delcroix , Keisuke Kinoshita , Tsubasa Ochiai , Naoyuki Kamo ,Reinhold Haeb-Umbach , Yanmin Qian MoE Key Lab of Artiﬁcial Intelligence, AI Institute, SpeechLab, Shanghai Jiao Tong University, China Paderborn University, Germany Johns Hopkins University, USA NTT Corporation, Japan

ABSTRACT

Recently, the end-to-end approach has been successfully appliedto multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performancedegradation is still observed in the reverberant and noisy scenarios,and there is still a large performance gap between anechoic andreverberant conditions. In this work, we focus on the multichan-nel multi-speaker reverberant condition, and propose to extend ourprevious framework for end-to-end dereverberation, beamforming,and speech recognition with improved numerical stability and ad-vanced frontend subnetworks including voice activity detection likemasks. The techniques signiﬁcantly stabilize the end-to-end train-ing process. The experiments on the spatialized wsj1-2mix corpusshow that the proposed system achieves about 35% WER relativereduction compared to our conventional multi-channel E2E ASRsystem, and also obtains decent speech dereverberation and separa-tion performance (SDR = 12.5 dB) in the reverberant multi-speakercondition while trained only with the ASR criterion.

Index Terms — Neural beamformer, overlapped speech recog-nition, dereverberation, speech separation, cocktail party problem

1. INTRODUCTION

With the development of deep learning, much progress has beenachieved in the speech processing ﬁeld, including both speech en-hancement [1–3] in the frontend and automatic speech recognition(ASR) [4–6] in the backend. In recent years, more and more inter-ests have been focused on the deep learning based speech processingin the cocktail party scenario [7,8]. In this scenario, there are usuallymultiple speakers talking simultaneously, even with the presence ofbackground noise and reverberation. It is much more difﬁcult tocope with than in the clean and anechoic conditions, and the ASRperformance is still far behind humans in such conditions.In the cocktail party scenario, while it is straightforward to com-bine separately trained speech enhancement and speech recognitioncomponents as one system, as investigated in many prior studies[9, 10], the end-to-end (E2E) optimization of all involved compo-nents is also an important and interesting research topic. The E2Esystem can naturally reduce the mismatch between different compo-nents through joint training. In addition, only the noisy signal andthe corresponding transcriptions are required for the E2E training ofboth frontend and backend, making it much easier for data collec-tion and model training in real applications. Some prior work hasillustrated the potential of E2E optimized systems. Settle et al. [11]proposed a joint training framework, combining the chimera++ net-work [12] and end-to-end ASR [13] for single-channel multi-speakerspeech separation and recognition. In the multichannel condition, the neural beamformer [14, 15] based speech enhancement is oftenapplied to better utilize the spatial information. (1) Single-speakercases:

In [16–18], the neural beamformer is jointly trained with theacoustic / end-to-end ASR model for denoising and speech recog-nition. Subramanian et al. [19] further included dereverberation inthe joint training, which is based on the weighted prediction error(WPE) [20] algorithm. (2) Multi-speaker cases:

Chang et al. [21]proposed the MIMO-Speech architecture, where the beamformer isjointly trained with ASR to perform speech separation.In this paper, we aim to build a robust framework for the fullyend-to-end optimization of dereverberation, beamforming (denois-ing and separation), and speech recognition. In our prior work [22],some preliminary attempts have been made to explore the end-to-endtraining of three components: WPE-based dereverberation, neuralbeamforming, and end-to-end ASR. However, the well-known nu-merical instability issue [23] in operations of both WPE and beam-forming, usually caused by the singularity in the matrix inverse op-eration, is still unsolved in [22], leading to performance degradationor even misleading the model convergence.In this work, we try to tackle this problem, by proposing fourtechniques to improve the stability and performance of the end-to-end system. These methods have been proven extremely helpful inour setup, signiﬁcantly mitigating the numerical instability issue dur-ing training. Based on these techniques, we propose a robust archi-tecture that supports the end-to-end training of different beamformervariants and ASR, which are also compared in our experiments. Inaddition, the voice activity detection (VAD) like mask [19, 24] forWPE and beamforming is introduced to mitigate the frequency per-mutation problem in the end-to-end training, as described in Sec-tion 2.3. Our experiments on the spatialized wsj1-2mix [21] corpusshow that the proposed approaches can achieve signiﬁcant perfor-mance improvement compared to the previous system.

2. END-TO-END FRAMEWORK FORDEREVERBERATION, BEAMFORMING, AND ASR

In this section, we ﬁrst describe the proposed architecture for end-to-end dereverberation, beamforming (denoising and separation), andASR. And the formulation of different beamformer variants sup-ported in the proposed framework is given. We then introduce thetechniques applied to solve the numerical instability issue. Later, thefrequency permutation phenomenon and our solution are discussed.

Our proposed end-to-end architecture is shown in Fig. 1, which iscomprised of two main modules: the frontend (speech enhance-ment) and the backend (ASR). Here, speech enhancement includesdereverberation, denoising and source separation. In our previous a r X i v : . [ ee ss . A S ] F e b askNet End-to-End ASRWPE MVDR / WPD / wMPDR BeamformingSpeech Enhancement (Frontend) { ̂ X j } Jj =1 { M j wpe } Jj =1 { M j bf,tgt , M j bf,noise } Jj =1 { ̂ Y jt , f } Jj =1 …… … … … Y { ̂ R j } Jj =1 { R j } Jj =1 … LearnableFixed

PIT-based Loss

Fig. 1 : Proposed new architecture for end-to-end training of the fron-tend and ASR backend.article [22], we adopted a weighted power minimization distortion-less response (WPD) convolutional beamformer [25] as a uniﬁedfrontend, while the recent study [26] showed that a WPD can be fac-torized into a WPE dereverberation ﬁlter and a weighted minimumpower distortionless response (wMPDR) [26] beamformer withoutloss of optimality when they are jointly optimized. Therefore, in thisarticle, we adopt the factorized form as a simpler alternative . Thatis, the frontend is composed of a single mask estimator ( MaskNet ),a DNN-WPE [27] dereverberation module, and a beamformer mod-ule. In addition, we mainly support two alternative beamformertypes, respectively, based on 1) minimum variance distortionlessresponse (MVDR) [28] and 2) wMPDR. While MVDR is a widelyused state-of-the-art beamformer, wMPDR is shown to performoptimal processing jointly with WPE [26]. The ASR backend is ajoint connectionist temporal classiﬁcation (CTC) / attention-basedencoder-decoder [13] model for recognizing the separated single-channel speech. Compared to those in our previous work [22],the proposed architecture can support different beamformer vari-ants in a single framework, by using a single mask estimator forWPE / beamforming and applying single-source WPE for process-ing speech of different sources.Below we give a detailed description of the proposed system.Consider a multichannel input speech signal composed of J speak-ers, Y t,f = { Y t,f,c } Cc =1 ∈ C C , it can be described as follows in theshort-time Fourier transform (STFT) domain: Y t,f = J (cid:88) j =1 X jt,f + N t,f = J (cid:88) j =1 X (d) ,jt,f + X (r) ,jt,f + N t,f (1) X (d) ,jt,f = ∆ − (cid:88) τ =0 a jτ,f s jt − τ,f ≈ v jf s jt,f , (2) X (r) ,jt,f = L a (cid:88) τ =∆ a jτ,f s jt − τ,f , (3)where C > denotes the number of microphones. t ∈ { , . . . , T } and f ∈ { , . . . , F } represent the indices of time and frequencybins. N denotes noise. X j denotes the reverberant signal, whichcan be decomposed into an “early” part X (d) ,j and a “late” part X (r) ,j . X (d) ,j contains the direct path and early reﬂection of the j -th speaker,while X (r) ,j denotes the late reverberation. a jτ,f is the acoustic trans-fer function with length L a . ∆ denotes the starting frame for the“late” part. s j is the j -th source signal. v jf = (cid:8) v jf,c (cid:9) Cc =1 ∈ C C isthe steering vector (SV). The input signal is ﬁrst processed by thefrontend module for dereverberation and separation. First, the WPEsubmodule performs dereverberation separately for each source j di- The experimental result on WPD is also given in Table 2 for comparison. rectly on the mixture Y in Eq.(1): { M j wpe } Jj =1 , { M j bf , tgt } Jj =1 , { M j bf , noise } Jj =1 = MaskNet( Y ) , (4) λ jt,f = 1 C C (cid:88) c =1 M j wpe ,t,f,c T (cid:80) Tτ =1 M j wpe ,τ,f,c | Y t,f,c | ∈ R , (5) ˆ Y j = Y j wpe = WPE( Y , λ j ) ∈ C T × F × C . (6)Here, M j wpe = { M j wpe ,t,f,c } t,f,c denotes the estimated derever-beration mask, M j bf , tgt and M j bf , noise denote the estimated speechmask and distortion mask for the j -th speaker, respectively. λ j = { λ t,f } jt,f is the estimated time-varying power of the speech signal. WPE( · ) represents the dereverberation ﬁlter computation based onthe WPE algorithm described in [29], and the detailed formulasare omitted here for simplicity. The signal ˆ Y is then denoised andseparated by the neural beamformer. Within the scope of this paper,although different beamformers are designed for different objectiveswith a linear constraint, their solutions can be uniformly written as: Φ jα,f = (cid:80) Tt =1 M jt,f ˆ Y jt,f (cid:0) ˆ Y jt,f (cid:1) H (cid:80) Tt =1 M jt,f ∈ C C × C , (7) w jf = (cid:0) Φ j N ,f (cid:1) − Φ j S ,f Trace (cid:104)(cid:0) Φ j N ,f (cid:1) − Φ j S ,f (cid:105) u , [w/o SV] (8) = (cid:0) Φ j N ,f (cid:1) − v jf (cid:0) v jf (cid:1) H (cid:0) Φ j N ,f (cid:1) − v jf (cid:16) v jf,q (cid:17) ∗ , [w/ SV] (9) ˆ X jt,f = (cid:0) w jf (cid:1) H ˆ Y t,f ∈ C , (10)where M jt,f = C (cid:80) Cc =1 M jt,f,c is a channel-averaged mask, where M jt,f,c ∈ { M j bf , tgt ,t,f , M j bf , noise ,t,f } , and Φ jα,f is a covariance matrixwith a subscript α ∈ { N , S , noise } . We set M jt,f = M j bf , noise ,t,f for Φ j noise ,f and M jt,f = M j bf , tgt ,t,f for Φ j S ,f . Similarly, we set M jt,f = M j bf , noise ,t,f and M jt,f = 1 /λ jt,f , respectively, for Φ j N ,f of MVDR and wMPDR. ( · ) ∗ and ( · ) H denote conjugate and conju-gate transpose, respectively. w jf is the beamforming ﬁlter for the j -th speaker, which can be calculated with either Eq. (8) [w/o SV]or Eq. (9) [w/ SV]. While Eq. (8) has been widely used for theE2E training of neural beamformers [17, 21], Eq. (9) is a standardequation for distortionless beamformers. u is a vector denoting thereference channel, which can be estimated by the attention mecha-nism [30], or based on the average estimated a posteriori SNR [15],or manually set as a one-hot vector. The subscript q denotes the ref-erence channel index. ˆ X jt,f is the beamformed signal. v jf can becalculated through the eigendecomposition [31]: v jf = Φ j noise ,f MaxEigVec (cid:104)(cid:0) Φ j noise ,f (cid:1) − Φ j S ,f (cid:105) ∈ C C , (11)where MaxEigVec[ · ] calculates the eigenvector corresponding tothe maximum eigenvalue. Due to the lack of complex eigendecom-position support in PyTorch at the time of writing, we replace it withthe power iteration method [32], which can be easily implementedfor back-propagation, with a slight loss of precision.It is worth noting that in the sense of end-to-end training, theMVDR and wMPDR beamformers are potentially equivalent. Bysubstituting the Φ j N ,f for wMPDR deﬁned above into Eq. (8) orEq. (9), we can ﬁnd that the average operation in the denominator ofEq. (5) is canceled. Thus the derived wMPDR ﬁlter only depends onthe (inversed) mask predicted by the neural network, which is veryimilar to the MVDR formulation. So it is hard to tell which beam-former is actually learned by the network via end-to-end training.Finally, the separated stream ˆ X j = { ˆ X jt,f } t,f of each speaker j is fed into the ASR backend for recognition. First, the log Mel-ﬁlterbank coefﬁcients O j = { o j , . . . , o jT } with global mean andvariance normalization (GMVN-LMF ( · ) ) is extracted from ˆ X j ,which is then transformed by the encoder into a high-level represen-tation H j = { h j , . . . , h jL } ( L ≤ T ) with subsampling. In order tosolve the label ambiguity problem with multiple speakers ( J > ),the permutation invariant training (PIT) technique [33] is appliedin the CTC module to determine the order of the label sequences.With the best permutation derived in CTC, the representation H j is processed by the attention-based decoder to generate the outputtoken sequences ˆ R j = { ˆ R j , . . . , ˆ R jN } with length N , while R j inFig. 1 is the corresponding reference label. The speech recognitionprocess for each speaker j is formulated as follows: O j = GMVN-LMF( | ˆ X j | ) , (12) H j = Encoder( O j ) , (13) ˆ R jn ∼ Attention-Decoder( H j , ˆ R jn − ) , (14)where ˆ R jn is the output token at the n -th decoding step.Note that the entire system is optimized with sorely the ASRloss, which is a combination of the attention and CTC losses. The numerical instability issue has been a well-known problem inthe beamformer [34], especially when optimized in an end-to-endmanner. The numerical problem generally originates from the com-plex operations in the WPE and beamforming formulas, such as thecomplex matrix inverse, leading to poor performance in certain fre-quency bins sparsely populated. Such behaviors are particularly un-desirable in the joint training with ASR, as they can easily result innot-a-number (NaN) gradients that fail to backpropagate correctlyand even prevent the model from converging properly [22], thusbadly impacting the overall model performance. In order to mitigatethis problem, we propose four approaches to improve the stability ofboth WPE and beamforming submodules: (1) Diagonal loading

In order to stabilize the matrix inverse op-eration in WPE and beamforming in Eqs. (6), (8), (9) and (11), par-ticularly at its backward pass, we introduce a diagonal loading [34]term as a perturbation to the complex matrix Φ before inversion: Φ (cid:48) = Φ + ε Trace( Φ ) I , (15)where I is the identity matrix, and ε is a tiny constant. For betterstabilization, Trace( Φ ) is used to make the term adaptive to signallevel, and ε was set at a relatively large value for WPE in our exper-iments, as described in Section 3.1. (2) Mask ﬂooring When optimizing masks with an implicitcriterion, i.e. the ASR loss, we observed that the mask estimatorlearned to predict sparse or spiky masks. This means, the mask es-timator sets only the most relevant time-frequency bins to one, andthe remaining ones to zero. It can then result in a singular covariancematrix in some frequency bins, making the WPE / beamforming pro-cess unstable. To avoid the spiky masks, we propose a mask ﬂooringoperation to introduce some regularization to the masks in Eq. (4): ˆ M t,f = Maximum { M t,f , ξ } , (16)where ˆ M t,f denotes the ﬂoored mask value, M t,f ∈ { M wpe ,t,f,c ,M bf , tgt ,t,f , M bf , noise ,t,f } , and ξ is a constant ﬂooring factor. The ideaof the ﬂooring is, that enough values have to be nonzero to reducethe effect of the ﬂooring value. So the mask estimator is preventedfrom predicting sparse or spiky masks. (3) More stable complex matrix operations Due to the lackof complex support in PyTorch, the alternative method in Section4.3 in [35] was used in our previous work [22], which tries to ﬁnda factor to construct an invertible real matrix and maps the complexinversion to some real matrix operations. But it sometimes fails dueto the poor estimate of the factor that results in a singular matrix. Inthis paper, a more stable matrix inverse formula [36] is implemented,which converts the problem of complex matrix inverse Φ − = ( A + i B ) − ∈ C m × m into the inverse of a m × m real matrix: (cid:20) A B − B A (cid:21) − = (cid:20) R{ Φ − } I{ Φ − }−I{ Φ − } R{ Φ − } (cid:21) , (17)where R{·} and

I{·} denotes the real and imaginary parts of a com-plex matrix. Furthermore, we replace the inverse and the subsequentmultiplication operations in Eqs. (6), (8) and (9) with a solve op-eration, which directly computes the solution x to a linear matrixequation Φx = v , where x and v are m -dimensional vectors. Itfurther improves the numerical accuracy and stability. (4) Double precision In terms of the implementation, whilethe end-to-end systems normally operate with the single-precisiondata / parameters, we ﬁnd it beneﬁcial to use the double precisionfor complex operations in the frontend module. It can reduce theerror caused by complex operations, such as the inverse of close-to-singular matrices. Thus the stability of matrix inverse related opera-tions can also be improved. Similar effects are also reported in [37],which proposes to jointly optimize the WPE and acoustic models.With the above proposed techniques, we are now able to op-timize the convolutional beamformer and ASR jointly, without theneed of pretraining as in [22].

During the end-to-end optimization of the frontend and backend, weoften observed that beamformer outputs corresponding to differentspeakers are permuted with each other at certain frequencies. Thisis known as the frequency permutation problem [38]. It is proba-bly caused by the fact that beamforming ﬁlters are estimated inde-pendently at each frequency bin with the predicted time-frequency(T-F) masks, and that the log Mel-ﬁlterbank features used for eval-uating the ASR loss are obtained by averaging frequency bins witha triangle window, thus largely reducing the inﬂuence of the permu-tation errors on the loss. This, however, is not optimal for speechenhancement in the frontend. To solve this problem, instead of usingT-F masks in Eq. (4), we propose to use the voice activity detection(VAD) like masks [19, 24], which share the same (soft) value overthe frequency axis. This mask will be shown effective to mitigate thefrequency permutation problem in our experiments..

3. EXPERIMENTS3.1. Experimental setup

In this section, we evaluate our proposed framework on the arti-ﬁcially generated spatialized wsj1-2mix dataset [21], which con-tains anechoic and reverberant versions of multichannel two-speakerspeech mixtures. We trained our models on a multi-condition train-ing subset, including both reverberant and anechoic training sam-ples in the spatialized wsj1-2mix (98.5 hr × ), and WSJ train si284single-speaker clean data (81.5 hr, only for training ASR). Sincethe proposed framework jointly optimizes the frontend and backendwith the ASR loss, no parallel clean data is required for training. Thedevelopment and evaluation subsets only contain reverberant sam-ples from the spatialized wsj1-2mix, with the duration of 1.3 hr and The new implementations inverse2 and solve are now available at https://github.com/kamo-naoyuki/pytorch_complex . able 1 : Evaluation of the proposed techniques with the WPE +MVDR + ASR model of different architectures on the spatializedreverberant wsj1-2mix evaluation set. The number of ﬁlter taps K and channels C are set to 5 and 2 for evaluation (same as training),respectively. Architecture WER (%) PESQ STOI SDR (dB)Original mixture - 1.20 0.65 -1.45Arch in [22] 21.88 1.12 0.62 1.23+ (1) Diagonal loading 15.51 1.32 0.74 + (2) Mask ﬂooring 20.13 1.24 0.71 1.14+ (3) Stable complex op. 15.70 1.31 0.74 3.05+ (4) Double precision 18.06 1.27 0.73 1.99+ Tech (1)–(4) 15.18 1.31 0.74 2.85Proposed arch J × output layers, where the num-ber of speakers is J = 2 . The number of iterations for performingWPE is 1. During training, the number of channels C and WPE ﬁl-ter taps K are ﬁxed to 2 and 5, respectively. In the ASR backend,we followed the same conﬁgurations in [39]. We set ε in Eq. (15) to − and − for WPE and beamforming, respectively. The maskﬂooring factor ξ in Eq. (16) is set to − and − for WPE andbeamforming, respectively. The number of iterations for estimatingthe steering vector using the power iteration is set to 2. The refer-ence channel q in Eq.(8) is set to 1. The Noam optimizer with 25000warmup steps and an initial learning rate of 1.0 was used for training. We ﬁrst evaluate the proposed techniques in Section 2.2 in both pre-viously used [22] and the proposed architectures, as shown in Ta-ble 1. For speech recognition, we use the word error rate (WER) forevaluation. The speech enhancement (SE) performance is evaluatedusing three common metrics: signal-to-distortion ratio (SDR) [40],short-time objective intelligibility (STOI) [41] and perceptual eval-uation of speech quality score (PESQ) [42]. And the clean sourcesignal from WSJ is adopted as the reference signal. In Table 1, wecan observe that all proposed techniques can bring signiﬁcant perfor-mance improvement compared to the baseline architecture in [22].And the combination of the four techniques can further achieve abetter ASR result, with improved speech enhancement performance.This illustrates the effectiveness of the proposed approaches. Thelast row shows that with the proposed techniques, our proposed ar-chitecture in Fig.1 can also achieve comparable performance.We then evaluate the proposed architectures with differentbeamformer variants under different conﬁgurations of ﬁlter taps K ∈ { , , , , } , while the number of channels C is ﬁxed to 6,and only present the best performance of each model in Table 2 dueto the limited space. We also present the best ASR results from [22]in rows 2 and 3 for comparison, and the SE performance are alsoevaluated. Comparing rows 2 & 5 and rows 3 & 6, we can observethat the proposed methods greatly improve the ASR and SE perfor- More detailed results can be found at https://speechlab.sjtu.edu.cn/members/wangyou-zhang/icassp21-material.pdf . Table 2 : Evaluation of different beamformer variants and mask typeson the spatialized reverberant wsj1-2mix evaluation set. “w/ SV” and“w/o SV” in Eq. (8)–(9) denote with and without explicit use of thesteering vector, respectively.

ID Model (+ASR) Formula Mask WER PESQ STOI SDR1 WSJ eval92 [4] - - 4.4 - - -2 WPE+MVDR [22] w/o SV T-F 15.72 1.15 0.62 0.623 WPD [22] w/o SV T-F 13.97 1.33 0.68 0.384 MVDR w/o SV T-F 11.66 1.46 0.80 6.485 WPE+MVDR 9.50 1.56 0.83 7.736 WPE+wMPDR 9.44 1.63 0.82 8.497 WPE+MVDR w/ SV T-F

10 WPE+wMPDR 10.26 1.97 0.86 12.20 mances compared to the previous systems, which attributes to theproposed techniques for mitigating the numerical instability issue.From row 4 to row 5, the performance gain indicates the DNN-WPEsubmodule plays an important role in our proposed architecture.Comparing the second and third sections in Table 2, the MVDR andwMPDR beamformers show very similar results based on the formu-las in either Eq. (8) [w/o SV] or Eq. (9) [w/ SV]. This also indicatesthe potential equivalence of these beamformers in the end-to-endtraining, as mentioned in Section 2.1. And the latter formula tendsto yield better ASR results with end-to-end training. When compar-ing the second and the last sections in Table 2, we can ﬁnd that theproposed VAD-like masks are beneﬁcial for the SE performance,with obvious improvement on PESQ, STOI and SDR. This indicatesthat the VAD-like mask can effectively mitigate the frequency per-mutation problem, thus improving the SE performance. Since theevaluation set is generated based on the WSJ eval92 subset, the ﬁrstrow in Table 2 can be regarded as the topline for our system. Andwe can observe that the proposed models with different beamformervariants can all achieve very good ASR performance, with an only ∼

5% higher WER than the topline on WSJ.

4. CONCLUSIONS

In this paper, we propose a robust framework for end-to-end train-ing of dereverberation, beamforming (denoising and separation),and speech recognition. Four techniques are proposed to regular-ize and stabilize the WPE / beamforming process in the frontendmodule, which are shown to effectively improve the numerical sta-bility. Different beamformer variants and mask types are comparedin our proposed framework. Our experiments on the spatializedwsj1-2mix corpus show that the proposed end-to-end system canachieve fairly good ASR results, with also decent speech enhance-ment performance in the reverberant multi-speaker condition, whileonly optimized with the ASR criterion. In our future work, wewould like to investigate the end-to-end training in realistic andmore challenging conditions.

5. ACKNOWLEDGEMENT

Wangyou Zhang and Yanmin Qian were supported by the ChinaNSFC projects (No. 62071288 and U1736202). The work reportedhere was started at JSALT 2020 at JHU, with support from Mi-crosoft, Amazon and Google. Experiments were carried out on thePI supercomputers at Shanghai Jiao Tong University.

6. REFERENCES [1] Y. Liu and D. Wang, “Divide and conquer: A deep CASAapproach to talker-independent monaural speaker separation,”

EEE/ACM Trans. ASLP. , vol. 27, no. 12, pp. 2092–2102, 2019.[2] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efﬁ-cient long sequence modeling for time-domain single-channelspeech separation,” in

Proc. IEEE ICASSP , 2020, pp. 46–50.[3] N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speechseparation by speaker clustering,” arXiv:2002.08933 , 2020.[4] S. Karita et al. , “A comparative study on transformer vs RNNin speech applications,” in

Proc. IEEE ASRU , 2019, pp. 449–456.[5] Z. T¨uske et al. , “Single headed attention based sequence-to-sequence model for state-of-the-art results on switchboard-300,” arXiv:2001.07263 , 2020.[6] D. S. Park et al. , “Improved noisy student training for auto-matic speech recognition,” arXiv:2005.09629 , 2020.[7] E. C. Cherry, “Some experiments on the recognition of speech,with one and with two ears,”

The Journal of the AcousticalSociety of America , vol. 25, no. 5, pp. 975–979, 1953.[8] Y. Qian, C. Weng, X. Chang, S. Wang, and D. Yu, “Past review,current progress, and challenges ahead on the cocktail partyproblem,”

Frontiers of Information Technology & ElectronicEngineering , vol. 19, no. 1, pp. 40–63, 2018.[9] J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach,“BLSTM supported GEV beamformer front-end for the 3rdCHiME challenge,” in

Proc. IEEE ASRU , 2015, pp. 444–451.[10] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey,“Single-channel multi-speaker separation using deep cluster-ing,” in

Proc. ISCA Interspeech , 2016, pp. 545–549.[11] S. Settle, J. Le Roux, T. Hori, S. Watanabe, and J. R. Hershey,“End-to-end multi-speaker speech recognition,” in

Proc. IEEEICASSP , 2018, pp. 4819–4823.[12] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative ob-jective functions for deep clustering,” in

Proc. IEEE ICASSP ,2018, pp. 686–690.[13] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention basedend-to-end speech recognition using multi-task learning,” in

Proc. IEEE ICASSP , 2017, pp. 4835–4839.[14] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural networkbased spectral mask estimation for acoustic beamforming,” in

Proc. IEEE ICASSP , Mar. 2016, pp. 196–200.[15] H. Erdogan et al. , “Improved MVDR beamforming usingsingle-channel mask prediction networks,” in

Proc. ISCA In-terspeech , 2016, pp. 1981–1985.[16] X. Xiao et al. , “Deep beamforming networks for multi-channelspeech recognition,” in

Proc. IEEE ICASSP , 2016, pp. 5745–5749.[17] T. Ochiai et al. , “Uniﬁed architecture for multichannel end-to-end speech recognition with neural beamforming,”

IEEE Jour-nal of Selected Topics in Signal Processing , vol. 11, no. 8, pp.1274–1288, 2017.[18] J. Heymann et al. , “Beamnet: End-to-end training of abeamformer-supported multi-channel ASR system,” in

Proc.IEEE ICASSP , 2017, pp. 5325–5329.[19] A. S. Subramanian et al. , “An investigation of end-to-endmultichannel speech recognition for reverberant and mismatchconditions,” arXiv:1904.09049 , 2019.[20] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H.Juang, “Speech dereverberation based on variance-normalizeddelayed linear prediction,”

IEEE Trans. Audio, Speech, Lan-guage Process. , vol. 18, no. 7, pp. 1717–1731, 2010.[21] X. Chang, W. Zhang, Y. Qian, J. Le Roux, and S. Watan-abe, “MIMO-Speech: End-to-end multi-channel multi-speakerspeech recognition,” in

Proc. IEEE ASRU , 2019, pp. 237–244. [22] W. Zhang et al. , “End-to-end far-ﬁeld speech recognition withuniﬁed dereverberation and beamforming,” in

Proc. ISCA In-terspeech , 2020, pp. 324–328.[23] C. Y. Lim, C.-H. Chen, and W.-Y. Wu, “Numerical instabilityof calculating inverse of spatial covariance matrices,”

Statistics& Probability Letters , vol. 129, pp. 182–188, 2017.[24] Y.-H. Tu et al. , “An iterative mask estimation approach todeep learning based multi-channel speech recognition,”

SpeechCommunication , vol. 106, pp. 31–43, 2019.[25] T. Nakatani and K. Kinoshita, “A uniﬁed convolutional beam-former for simultaneous denoising and dereverberation,”

IEEESignal Processing Letters , vol. 26, no. 6, pp. 903–907, 2019.[26] C. Boeddeker et al. , “Jointly optimal dereverberation andbeamforming,” in

Proc. IEEE ICASSP , 2020, pp. 216–220.[27] K. Kinoshita et al. , “Neural network-based spectrum estima-tion for online WPE dereverberation,” in

Proc. ISCA Inter-speech , 2017, pp. 384–388.[28] B. D. Van Veen and K. M. Buckley, “Beamforming: A versa-tile approach to spatial ﬁltering,”

IEEE ASSP Magazine , vol. 5,no. 2, pp. 4–24, 1988.[29] L. Drude et al. , “NARA-WPE: A Python package for weightedprediction error dereverberation in Numpy and Tensorﬂowfor online and ofﬂine processing,” in

13. ITG FachtagungSprachkommunikation (ITG 2018) , Oct 2018.[30] T. Ochiai et al. , “Multichannel end-to-end speech recognition,”in

Proc. ICML , 2017, pp. 2632–2641.[31] N. Ito, S. Araki, M. Delcroix, and T. Nakatani, “Probabilis-tic spatial dictionary based online adaptive beamforming formeeting recognition in noisy and reverberant environments,”in

Proc. IEEE ICASSP , 2017, pp. 681–685.[32] R. Mises and H. Pollaczek-Geiringer, “Praktische verfahrender gleichungsauﬂ¨osung.”

ZAMM—Zeitschrift f¨ur AngewandteMathematik und Mechanik , vol. 9, no. 2, pp. 152–164, 1929.[33] D. Yu et al. , “Permutation invariant training of deep models forspeaker-independent multi-talker speech separation,” in

Proc.IEEE ICASSP , 2017, pp. 241–245.[34] S. Chakrabarty and E. A. Habets, “On the numerical instabilityof an LCMV beamformer for a uniform linear array,”

IEEESignal Processing Letters , vol. 23, no. 2, pp. 272–276, 2015.[35] K. B. Petersen and M. S. Pedersen, “The matrix cookbook,”

Technical Univ. Denmark, Tech. Rep , vol. 3274, 2012.[36] W. Smith and S. Erdman, “A note on the inversion of complexmatrices,”

IEEE Transactions on Automatic Control , vol. 19,no. 1, pp. 64–64, 1974.[37] J. Heymann et al. , “Joint optimization of neural network-basedWPE dereverberation and acoustic model for robust onlineASR,” in

Proc. IEEE ICASSP , 2019, pp. 6655–6659.[38] M. Z. Ikram et al. , “A beamforming approach to permutationalignment for multichannel frequency-domain blind speechseparation,” in

Proc. IEEE ICASSP , 2002, pp. 881–884.[39] X. Chang, W. Zhang, Y. Qian, J. Le Roux, and S. Watan-abe, “End-to-end multi-speaker speech recognition with trans-former,” in

Proc. IEEE ICASSP , 2020, pp. 6129–6133.[40] C. F´evotte et al. , “BSS-EVAL Toolbox User Guide—Revision2.0,” IRISA, Tech. Rep. 1706, Apr. 2005.[41] J. Jensen and C. H. Taal, “An algorithm for predicting the in-telligibility of speech masked by modulated noise maskers,”

IEEE/ACM Trans. ASLP. , vol. 24, no. 11, pp. 2009–2022, 2016.[42] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hek-stra, “Perceptual evaluation of speech quality (PESQ)—a newmethod for speech quality assessment of telephone networksand codecs,” in