[PDF] Continuous Speech Separation with Conformer

Abstract

Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription. The separation model extracts a single speaker signal from a mixed speech. In this paper, we use transformer and conformer in lieu of recurrent neural networks in the separation system, as we believe capturing global information with the self-attention based method is crucial for the speech separation. Evaluating on the LibriCSS dataset, the conformer separation model achieves state of the art results, with a relative 23.5% word error rate (WER) reduction from bi-directional LSTM (BLSTM) in the utterance-wise evaluation and a 15.4% WER reduction in the continuous evaluation.

Full PDF

CCONTINUOUS SPEECH SEPARATION WITH CONFORMER

Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li, Takuya YoshiokaChengyi Wang, Shujie Liu, Ming Zhou ∗ Microsoft Corporation

ABSTRACT

Continuous speech separation was recently proposed to deal withoverlapped speech in natural conversations. While it was shown tosigniﬁcantly improve the speech recognition performance for multi-channel conversation transcription, its effectiveness has yet to beproven for a single-channel recording scenario. This paper exam-ines the use of Conformer architecture in lieu of recurrent neuralnetworks for the separation model. Conformer allows the separa-tion model to efﬁciently capture both local and global context in-formation, which is helpful for speech separation. Experimental re-sults using the LibriCSS dataset show that the Conformer separationmodel achieves state of the art results for both single-channel andmulti-channel settings. Results for real meeting recordings are alsopresented, showing signiﬁcant performance gains in both word errorrate (WER) and speaker-attributed WER.

Index Terms — Multi-speaker ASR, Transformer, Conformer,Continuous speech separation

1. INTRODUCTION

The advance in deep learning has drastically improved the accuracyand robustness of modern automatic speech recognition (ASR) sys-tems in the past decade [1, 2, 3, 4], enabling various voice-basedapplications. However, when applied to acoustically and linguisti-cally complicated scenarios such as conversation transcription [5, 6],the ASR systems still suffer from the performance limitation due tooverlapped speech and quick speaker turn-taking, which break theusually assumed single active speaker condition. Additionally, theoverlapped speech causes the so-called permutation problem [7], fur-ther increasing the difﬁculty of the conversation transcription.Speech separation is often applied as a remedy for this problem,where the mixed speech is processed by a specially trained separa-tion network before ASR . Starting from deep clustering (DC) [7]and permutation invariant training (PIT) [8, 9], various separationmodels have been shown effective in handling overlapped speech[6, 10, 11, 12]. Among the network architectures proposed thusfar, the Transformer [12] based approach achieved a promising re-sult. Transformer was ﬁrst introduced for machine translation [13]and later extended to speech processing [14]. A Transformer basedspeech separation architecture was proposed in [12], achieving thestate of the art separation quality on the WSJ0-2mix dataset. It wasalso reported in [15] that incorporating Transformer into an end-to-end multi-speaker recognition network yielded higher recognitionaccuracy. However, both studies were evaluated on artiﬁcially simu-lated data sets that only considered overlapped speech, assuming the ∗ Emails: v-sanych, yuwu1, zhuc, wujian, jinyli, tayoshio, v-chengw, shu-jliu, mingzhou @microsoft.com utterance boundaries to be provided, which signiﬁcantly differs fromthe real conversational transcription scenario [6, 16].In this work, inspired by the recent advances in transducer-basedend-to-end ASR modeling, which has evolved from a recurrent neu-ral network (RNN) transducer [17] to Transformer [18] and Con-former [19] transducers, we examine the use of the Conformer ar-chitecture for continuous speech separation (CSS) [20]. Unlike theprior speech separation studies, in CSS, the separation network con-tinuously receives a mixed speech signal, performs separation, androutes each separated utterance to one of its output channels in a waythat each output channel contains overlap-free signals. This allows astandard ASR system trained with single speaker utterances to be di-rectly applied to each output channel to generate transcriptions. Theproposed system is evaluated by using the LibriCSS dataset [16],which consists of real recordings of long-form multi-talker sessionsthat were created by concatenating and mixing LibriSpeech utter-ances with various overlap ratios. Our proposed network signiﬁ-cantly outperforms the RNN-based baseline systems, achieving thenew state of the art performance on this dataset. Evaluation resultson real meetings are also presented along with tricks for further per-formance improvement.

2. APPROACH2.1. Problem Formulation

The goal of speech separation is to estimate individual speaker sig-nals from their mixture, where the source signals may be overlappedwith each other wholly or partially. The mixed signal is formulatedas y ( t ) = (cid:80) Ss =1 x s ( t ) , where t is the time index, x s ( t ) denotesthe s -th source signal, and y ( t ) is the mixed signal. Following [20],when C microphones are available, the model input to the separationmodel can be obtained as Y ( t, f ) = Y ( t, f ) ⊕ IPD (2) . . . ⊕ IPD ( C ) , (1)where ⊕ means a concatenation operation, Y i ( t, f ) refers to theSTFT of the i -th channel, IPD ( i ) is the inter-channel phase differ-ence between the i -th channel and the ﬁrst channel, i.e. IPD ( i ) = θ i ( t, f ) − θ ( t, f ) with θ i ( t, f ) being the phase of Y i ( t, f ) . Thesefeatures are normalized along the time axis. If C = 1 , it reduces toa single channel speech separation task.Following [21, 22], a group of masks { M s ( t, f ) } ≤ s ≤ S are es-timated with a deep learning model f ( · ) instead of f directly pre-dicting the source STFTs. Each source STFT, X s ( t, f ) , is obtainedas M s ( t, f ) (cid:12) Y ( t, f ) , where (cid:12) is an elementwise product. Forthe multi-channel setting, the source signals are obtained with adap-tive minimum variance distortionless response (MVDR) beamform-ing [23]. In this paper, we employ the Conformer structure [19] as f ( · ) to estimate the masks for (continuous) speech separation. a r X i v : . [ ee ss . A S ] O c t ig. 1 . Conformer architecture. There are three mask outputs, twofor speakers and one for noise. Conformer [19] is a state-of-the-art ASR encoder architecture, whichinserts a convolution layer into a Transformer block to increase thelocal information modeling capability of the traditional Transformermodel [13]. The architecture of the Conformer is shown in Fig. 1,where each block consists of a self-attention module, a convolutionmodule, and a macron-feedforward module. A chunk of Y ( t, f ) over time frames and frequency bins is the input of the ﬁrst block.Suppose that the input to the i -th block is z , the i -th block output iscalculaed as ˆ z = z + 12 FFN ( z ) (2) z (cid:48) = selfattention (ˆ z ) + ˆ z (3) z (cid:48)(cid:48) = conv ( z (cid:48) ) + z (cid:48) (4) output = layernorm ( z (cid:48)(cid:48) + 12 FFN ( z (cid:48)(cid:48) )) , (5)where FFN () , selfattention () , conv () , and layernorm () denote thefeed forward network, self-attention module, convolution module,and layer normalization, respectively. In the self-attention module, ˆz is linearly converted to Q , K , V with three different parametermatrices. Then, we apply a multi-head self-attention mechanismMultihead ( Q , K , V ) = [ H . . . H d head ] W head (6) H i = softmax ( Q i ( K i + pos ) (cid:124) √ d k ) V i , (7)where d k is the dimensionality of the feature vector, d head is thenumber of the attention heads. pos = { rel m,n } ∈ R M × M × d k is the relative position embedding [24], where M is the maximumchunk length and rel m,n ∈ R d k is a vector representing the offsetof m and n with m and n denoting the m -th vector of Q i and the n -th vector of K i , respectively. The Convolution starts with a point-wise convolution and a gated linear unit (GLU), followed by a 1-D Fig. 2 . Chunk-wise processing is employed to enable streaming pro-cessing for continuous speech separation.depthwise convolution layer with a Batchnorm [25] and a Swish ac-tivation. After obtaining the Conformer output, we further convert itto a mask matrix as M s ( t, f ) = sigmoid ( FFN s ( output )) . The speech overlap usually takes place in a natural conversationwhich may last for tens of minutes or longer. To deal with such longinput signals, CSS generates a predeﬁned number of signals whereoverlapped utterances are separated and then routed to different out-put channels.To enable this, we employ the chunk-wise processing proposedin [26] at test time. A sliding-window is applied as illustrated inFigure 2, which contains three sub-windows, representing the history( N h frames), the current segment ( N c frames), and the future context( N f frames). We move the window position forward by N c frameseach time, and compute the masks for the current N c frames usingthe whole N -frame-long chunk.To further consider the history information beyond the currentchunk, we also consider taking account of the previous chunks in theself-attention module. Following Transformer-XL [27], the Equa-tion 7 is rewritten assoftmax ( Q i ( K i ⊕ K cache , i + pos ) (cid:124) √ d k )( V i ⊕ V cache , i ) (8)where Q is obtained by the current chunk while K and V are theconcatenations of the previous and current changes in the key andvalue spaces, respectively. The dimensionality of K cache , i dependson the number of the history chunks considered.

3. EXPERIMENT3.1. Datasets

Our training dataset consists of 219 hours of artiﬁcially reverberatedand mixed utterances that sampled randomly from WSJ1 [29]. Fourdifferent mixture types described in [20] are included in the trainingset. To generate each training mixture, we randomly pick one or twospeakers from WSJ1 and convolve each with a 7 channel room im-pulse response (RIR) simulated with the image method [30]. The re-verberated signals are then rescaled and mixed with a source energyratio between -5 and 5 dB. In addition, we add simulated isotropicnoise [31] with a 0–10 dB signal to noise ratio. The average overlapratio of the training set is around 50%.LibriCSS is used for evaluation [16]. The dataset has 10 hoursof seven-channel recordings of mixed and concatenated LibriSpeechtest utterances. The recordings were made by playing back themixed audio in a meeting room. Two evaluation schemes are used:utterance-wise evaluation and continuous input evaluation. In theformer evaluation, the long-form recordings are segmented into in-dividual utterances by using ground-truth time marks to evaluate able 1 . Utterance-wise evaluation for seven-channel and single-channel settings. Two numbers in a cell denote %WER of the hybrid ASRmodel used in LibriCSS [16] and

E2E Transformer based ASR model [28]. 0S and 0L are utterances with short/long inter-utterance silence.

System Overlap ratio in %

0S 0L 10 20 30 40No separation [16] 11.8/5.5 11.7/5.2 18.8/11.4 27.2/18.8 35.6/27.7 43.3/36.6Seven-channel EvaluationBLSTM / / /3.9 11.9/ / / / / Single-channel EvaluationBLSTM 15.8/6.4 14.2/5.8 18.9/9.6 25.4/15.3 31.6/20.5 35.5/25.2Transformer-base 13.2/5.5 12.3/5.2 16.5/8.3 21.8/12.1 26.2/15.6 30.6/19.3Transformer-large 13.0/

Conformer-base 13.8/5.6 12.5/5.4 16.7/8.2 21.6/11.8 26.1/15.5 30.1/18.9Conformer-large /5.4 12.2/ /7.5 / /13.8 /17.1the pure separation performance. In the contuous input evaluation,systems have to deal with the unsegmented recordings and thus CSSis needed. We use BLSTM and Transformers as our baseline speech separa-tion models. The BLSTM model has three BLSTM layers with1024 input dimensions and 512 hidden dimensions, resulting in21.80M parameters. There are three masks, two for speakers andone for noise. The noise mask is used to enhance the beamforming[26]. We use three sigmoid projection layers to estimate each mask.Transformer-base and Transformer-large models with 21.90M and58.33M parameters are our two Transformer-based baselines. TheTransformer-base model consists of 16 Transformer encoder layerswith 4 attention heads, 256 attention dimensions and 2048 FFN di-mensions. The Transformer-large model consists of 18 Transformerencoder layers with 8 attention heads, 512 attention dimensions and2048 FFN dimensions.As with the Transformer baseline models, we experiment withtwo Conformer-based models, Conformer-base and Conformer-large. They have 22.07M and 58.72M parameters, respectively. TheConformer-base model consists of 16 Conformer encoder layerswith 4 attention heads, 256 attention dimensions and 1024 FFNdimensions. The Conformer-large model consists of 18 Conformerencoder layers with 8 attention heads, 512 attention dimensions and1024 FFN dimensions. Both Conformer and Transformer are trainedwith the AdamW optimizer [32], where the weight decay is set to1e-2. We set the learning rate to 1e-4 and use a warm-up learningschedule with a linear decay, in which the warmp-up step is 10,000and the training step is 260,000.We use two ASR models to evaluate the speech separation ac-curacy. One is the ASR model used in the original LibriCSS pub-lication [16], which is a hybrid system using a BLSTM acousticmodel and a 4-gram language model. The other one is one of the bestopen source end-to-end Transformer based ASR models [28], whichachieves 2.08% and 4.95% word error rates (WERs) for LibriSpeechtest-clean and test-other, respectively. Following [16], we gener-ate the separated speech signals with spectral masking and mask-based adaptive minimum variance distortionless response (MVDR)beamforming for the single-channel and seven-channel cases, re-spectively. For a fair comparison, we follow the LibriCSS setting for chunk-wise CSS processing, where N h , N c , N f are set to 1.2s,0.8s, 0.4s respectively. Table 1 shows the WER of the utterance wise evaluation for theseven-channel and single-channel settings. Our Conformer mod-els achieved state-of-the-art results. Compared with BLSTM,Conformer-base yielded substantial WER gains for the 7-channelsetting. The fact that the Conformer-base model outperformedTransformer-base for almost all the settings indicates Conformer’ssuperior local modeling capability. Also, the larger models achievedbetter performance in the highly overlapped settings. As regardsthe single-channel case, while the overall WERs were higher, thetrend was consistent between the single- and multi-channel cases,except for the non-overlap scenario. With the seven channel input,all models showed similar performance for 0S and 0L. On the otherhand, when only one channel was used, the self-attention modelswere markedly better. This could indicate that the seven-channelfeatures contain sufﬁciently rich information for simpler networksto do the beamforming well. Meanwhile, the information in thesingle-channel signal is quite limited, requiring a more advancedstructure.

Table 2 shows the continuous input evaluation results. The Con-former and Transformer models performed consistently better thanBLSTM, but their performance gap became smaller in the large over-lap test-set. The relative WER gains obtained with Conformer-baseover BLSTM were 4 % and 15 % for the hybrid and transducer ASRsytems, respectively, which were smaller than those obtained for theutterance-wise evaluation. A possible explanation is that the self-attention based methods are good at using global information whilethe chunk-wise processing limits teh use of the context information.It is noteworthy that 0S results were much worse than those of0L only in the continuous evaluation, which is consistent with theprevious report [16]. The 0S dataset contains much more quickspeaker turn changes, imposing a challenge for both speech sepa-ration and ASR. The self-attention-based models showed clear im-provement over BLSTM, indicating that they are also helpful fordealing with turn-takings in natural conversations. able 2 . Continuous speech separation evaluation for seven-channel and single-channel settings. System Overlap ratio in %

0S 0L 10 20 30 40No separation [16] 15.4/12.7 11.5/5.7 21.7/17.6 27.0/24.4 34.3/30.9 40.5/37.5Seven-channel EvaluationBLSTM 11.4/6.0 /4.1 13.1/7.0 14.9/7.9 18.7/11.5 20.5/12.3Transformer-base 12.0/5.6 9.1/4.4 13.4/6.2 14.4/6.8 18.5/9.7 19.9/10.3Transformer-large /5.4 8.8/ /6.0 13.6/ /9.3 /10.2Conformer-base 11.1/5.6 8.7/ / /6.8 17.6/ Conformer xl -base 11.4/5.4 8.7/4.1 13.2/6.2 13.6/6.7 17.8/9.5 20.0/10.8Conformer xl -large 11.0/5.2 8.8/4.1 12.9/5.8 13.7/6.7 17.5/9.4 19.8/10.6Single-channel EvaluationBLSTM 19.1/11.7 16.1/9.7 22.1/14.5 27.4/19.1 33.0/25.9 37.6/30.1Transformer-base 13.8/7.1 /6.6 16.7/9.6 20.8/13.3 26.7/18.6 31.0/21.6Transformer-large /7.2 12.3/6.9 /9.5 / /16.9 / Conformer-base 14.1/7.7 13.0/7.1 17.4/10.6 21.9/13.7 27.4/18.7 32.0/22.4Conformer-large 13.3/

Table 2 also shows that the Conformer xl models using longercontext information did not result in lower WERs especially in thelarge overlap ratio settings. Two factors may have contributed to theperformance degradation. 1) The unexpected noise may have beenintroduced from the use of the longer history, which may containmore speakers’ voices. 2) Also, we did not consider the overlapregions of the adjacent windows during training, possibly makingthe training/testing gap greater and resulting in sub-optimal perfor-mance. We leave the training with overlap regions for the futurework. To further verify the effectiveness of our method, we further conductan experiment on an internal real conversation corpus which con-sists of 15.8 hours of single channel recordings of daily group dis-cussions, noted as the Real Conversation dataset. In this dataset, theper-meeting speaker number ranges from 3 to 22. We applied a mod-iﬁed version of the conversation transcription system of [6], where alarge scale trained speech recognizer and speaker embedding extrac-tor were included, to obtain speaker attributed transcriptions.Compared with LibriCSS, those real meetings are signiﬁcantlymore complex with respect to the acoustics, linguistics, and inter-speaker dynamics. To deal with the real data challenges, threeimprovements were made. Firstly, we increased the training dataamount to 1500 hours. Additional clean speech samples were takenfrom a Microsoft internal corpus and they were mixed with thesimulation setup as Section 3.1. Secondly, the separation networksometimes generated a low volume residual signal from the redun-dant output channel for single speaker regions, which increased theword insertion errors. To mitigate this, we introduced a mergingscheme, where the two channel outputs were merged when a singleactive speaker was judged to be present. The merger was triggeredwhen only one masked channel had a signiﬁcantly large energy.Lastly, to reduce the distortion introduced by the masking opera-tion, we used single speaker signals corruped by background noiseas a training target. This allowed the separation network to focusonly on the separation task and leave the noise to the ASR model.The WER and speaker attributed WER (SA-WER) were used forevaluation, where the latter assesses the combined quality of speechtranscription and speaker diarization [6].

Table 3 . Continuous evaluation on a real meeting dataset. system

Data WERR SA-WERROriginal N/A 0 0BLSTM 219hr -6.4% -18.8%Conformer-base 219hr -7.2% -6.3%Conformer-large 219hr -2.5 % 1.9 %Conformer-base 1500hr 9.5% 8.8%Conformer-base-merge 1500hr 8.4% 10.13%Conformer-base-merge-nlabel 1500hr 11.8% 13.7%Conformer-large-merge-nlabel 1500hr 8.08% 18.4%Table 3 shows the WER and SA-WER reduction rates. Withthe three improvements described above, the proposed model re-duced the WER and SA-WER by . and . relative, re-spectively, compared with a system without the separation front-end.Although the BLSTM based network improved the recognition re-sult for the LibriCSS dataset especially for the high overlap ratiosettings, it largely degraded the speech recognition and speaker di-arization performance on the Real Conversation dataset. Because thespeech overlap happens only sporadially in real conversations, it isimportant for the separation model not to hurt the performance forless overlap cases. Thanks to the better modeling capacity, the Con-former based models signiﬁcantly mitigates the performance degra-dation. In addition, it can be seen that each introduced step broughtabout consistent improvement for both performance metrics.

4. CONCLUSION

In this work, we investigated the use of Conformer for continuousspeech separation. The experimental results showed that it outper-formed RNN-based models for both utterance-wise evaluation andcontinuous input evaluation. The superiority of Conformer to Trans-former was also observed. This work is also the ﬁrst to report sub-stantial WER and SA-WER gains from the speech separation in asingle-channel real meeting transcription task. The results indicatethe usefulness of appropriately utilizing context information in thespeech separation. . REFERENCES [1] Jinyu Li, Rui Zhao, Zhuo Chen, Changliang Liu, Xiong Xiao,Guoli Ye, and Yifan Gong, “Developing far-ﬁeld speaker sys-tem via teacher-student learning,” in

Proc. ICASSP . IEEE,2018, pp. 5699–5703.[2] Ladislav Moˇsner, Minhua Wu, et al., “Improving noise ro-bustness of automatic speech recognition via parallel data andteacher-student learning,” in

Proc. ICASSP . IEEE, 2019, pp.6475–6479.[3] Lei Sun, Jun Du, et al., “A speaker-dependent approach toseparation of far-ﬁeld multi-talker microphone array speech forfront-end processing in the chime-5 challenge,”

IEEE JSTSP ,vol. 13, no. 4, pp. 827–840, 2019.[4] Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, andShujie Liu, “On the comparison of popular end-to-end modelsfor large scale speech recognition,” in

Interspeech , 2020.[5] Shinji Watanabe, Michael Mandel, Jon Barker, and EmmanuelVincent, “Chime-6 challenge: Tackling multispeaker speechrecognition for unsegmented recordings,” arXiv preprintarXiv:2004.09249 , 2020.[6] Takuya Yoshioka, Igor Abramovski, et al., “Advances in on-line audio-visual meeting transcription,” in

Proc. ASRU . IEEE,2019, pp. 276–283.[7] John R Hershey, Zhuo Chen, Jonathan Le Roux, and ShinjiWatanabe, “Deep clustering: Discriminative embeddings forsegmentation and separation,” in

Proc. ICASSP . IEEE, 2016,pp. 31–35.[8] Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen,“Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in

Proc. ICASSP .IEEE, 2017, pp. 241–245.[9] Morten Kolbæk, Dong Yu, and Others, “Multitalker speechseparation with utterance-level permutation invariant trainingof deep recurrent neural networks,”

IEEE/ACM TASLP , vol.25, no. 10, pp. 1901–1913, 2017.[10] Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing idealtime–frequency magnitude masking for speech separation,”

IEEE/ACM TASLP , vol. 27, no. 8, pp. 1256–1266, 2019.[11] Yi Luo, Zhuo Chen, and Takuya Yoshioka, “Dual-pathrnn: efﬁcient long sequence modeling for time-domain single-channel speech separation,” in

Proc. ICASSP . IEEE, 2020, pp.46–50.[12] Jingjing Chen, Qirong Mao, and Dong Liu, “Dual-pathtransformer network: Direct context-aware modeling forend-to-end monaural speech separation,” arXiv preprintarXiv:2007.13975 , 2020.[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin, “Attention is all you need,” in

NIPS , 2017, pp.5998–6008.[14] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: ano-recurrence sequence-to-sequence model for speech recog-nition,” in

Proc. ICASSP . IEEE, 2018, pp. 5884–5888.[15] Xuankai Chang, Wangyou Zhang, Yanmin Qian, JonathanLe Roux, and Shinji Watanabe, “End-to-end multi-speakerspeech recognition with transformer,” in

ICASSP . IEEE, 2020,pp. 6134–6138. [16] Zhuo Chen, Takuya Yoshioka, et al., “Continuous speech sep-aration: Dataset and analysis,” in

Proc. ICASSP . IEEE, 2020,pp. 7284–7288.[17] Yanzhang He, Tara N Sainath, et al., “Streaming end-to-endspeech recognition for mobile devices,” in

Proc. ICASSP .IEEE, 2019, pp. 6381–6385.[18] Qian Zhang, Han Lu, Hasim Sak, et al., “Transformer trans-ducer: A streamable speech recognition model with trans-former encoders and RNN-T loss,” in

Proc. ICASSP , 2020.[19] Anmol Gulati, James Qin, et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprintarXiv:2005.08100 , 2020.[20] Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, and Fil All-eva, “Multi-microphone neural speech separation for far-ﬁeldmulti-talker speech recognition,” in

ICASSP . IEEE, 2018, pp.5739–5743.[21] Yuxuan Wang, Arun Narayanan, and DeLiang Wang, “Ontraining targets for supervised speech separation,”

IEEE/ACMTASLP , vol. 22, no. 12, pp. 1849–1858, 2014.[22] Hakan Erdogan, John R Hershey, Shinji Watanabe, andJonathan Le Roux, “Deep recurrent networks for separa-tion and recognition of single-channel speech in nonstationarybackground audio,” in

New Era for Robust Speech Recogni-tion , pp. 165–186. Springer, 2017.[23] M. Souden, S. Araki, K. Kinoshita, T. Nakatani, andH. Sawada, “A multichannel mmse-based framework forspeech source separation and noise reduction,”

IEEE/ACMTASLP , vol. 21, no. 9, pp. 1913–1928, 2013.[24] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani, “Self-attention with relative position representations,” in

NAACL ,2018, pp. 464–468.[25] Sergey Ioffe and Christian Szegedy, “Batch normalization: Ac-celerating deep network training by reducing internal covariateshift,” in

ICML , 2015, pp. 448–456.[26] Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, Xiong Xiao,and Fil Alleva, “Recognizing overlapped speech in meetings:A multichannel separation approach using neural networks,” in

Interspeech , 2018, pp. 3038–3042.[27] Zihang Dai, Zhilin Yang, and Others, “Transformer-xl: At-tentive language models beyond a ﬁxed-length context,” arXivpreprint arXiv:1901.02860 , 2019.[28] Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, LiangLu, Shuo Ren, Guoli Ye, Sheng Zhao, and Ming Zhou, “Se-mantic mask for transformer based end-to-end speech recogni-tion,”

Interspeech , 2020.[29] Linguistic Data Consortium Philadelphia, “CSR-II (WSJ1)Complete,” 1994, http://catalog.ldc.upenn.edu/LDC94S13A .[30] J. Allen and D. Berkley, “Image method for efﬁciently simulat-ing small-room acoustics,”

JASA , vol. 65, pp. 943–950, 1979.[31] Emanu¨el AP Habets and Sharon Gannot, “Generating sensorsignals in isotropic noise ﬁelds,”

JASA , vol. 122, no. 6, pp.3464–3470, 2007.[32] Ilya Loshchilov and Frank Hutter, “Decoupled weight decayregularization,” in