Dual-Path Modeling for Long Recording Speech Separation in Meetings
Chenda Li, Zhuo Chen, Yi Luo, Cong Han, Tianyan Zhou, Keisuke Kinoshita, Marc Delcroix, Shinji Watanabe, Yanmin Qian
DDUAL-PATH MODELING FOR LONG RECORDING SPEECH SEPARATION IN MEETINGS
Chenda Li , Zhuo Chen , Yi Luo , Cong Han , Tianyan Zhou ,Keisuke Kinoshita , Marc Delcroix , Shinji Watanabe , Yanmin Qian MoE Key Lab of Artificial Intelligence, AI Institute, SpeechLab, Shanghai Jiao Tong University, Microsoft Corporation, Columbia University, NTT Corporation, Johns Hopkins University
ABSTRACT
The continuous speech separation (CSS) is a task to separate thespeech sources from a long, partially overlapped recording, whichinvolves a varying number of speakers. A straightforward extensionof conventional utterance-level speech separation to the CSS task isto segment the long recording with a size-fixed window and processeach window separately. Though effective, this extension fails tomodel the long dependency in speech and thus leads to sub-optimumperformance. The recent proposed dual-path modeling could be aremedy to this problem, thanks to its capability in jointly modelingthe cross-window dependency and the local-window processing. Inthis work, we further extend the dual-path modeling framework forCSS task. A transformer-based dual-path system is proposed, whichintegrates transform layers for global modeling. The proposed mod-els are applied to LibriCSS, a real recorded multi-talk dataset, andconsistent WER reduction can be observed in the ASR evaluation forseparated speech. Also, a dual-path transformer equipped with con-volutional layers is proposed. It significantly reduces the computa-tion amount by with better WER evaluation. Furthermore, theonline processing dual-path models are investigated, which shows relative WER reduction compared to the baseline.
Index Terms — continuous speech separation, long recordingspeech separation, online processing, dual-path modeling
1. INTRODUCTION
In recent years, the performance of speech separation has been sig-nificantly advanced [1–14]. However, when applied to real-worldprocessing, most existing multi-talker automatic speech recognition(ASR) [15–20] and speech separation systems suffer from two kindsof mismatches. First, those systems are usually trained with well-segmented short recordings (e.g. WSJ0-2mix [1]), but in the realworld, the duration of conversations varies and could be very long inscenarios such as meetings. Second, these systems often assume thatthe speech is fully overlapped during training, which barely happensin real-world conversations. E.g. as [21] suggests, the overlap ratiois usually lower than in a meeting scenario.The continuous speech separation (CSS) [22,23] is recently pro-posed to address the long recording separation for real-world appli-cations, where the long recording is split into smaller length-fixedwindows. The window-level speech separation is performed inde-pendently. The outputs from adjacent windows are concatenated, or stitched , into long output streams. Ideally, each output stream shouldonly contain overlap-free speech. And then speaker diarization andASR can be performed on the overlap-free speech without changingtheir assumption on single active speaker. When the window sizebecomes smaller, given the overlapping characteristics of the real speech, it is reasonable to assume that each window does not con-tain more than or speakers. Thus, the speech separation systemtrained with short speech and a small number of overlapping speak-ers can be applied to the long recording speech separation. One lim-itation in CSS lies in its incapability of capturing information fromlong span recording. As each window is processed independently,the receptive field of the separation system is the window length. Asthe context in the long sequence signal usually contains informationsuch as speaker identity, which has been shown beneficial for sep-aration [24, 25], a cross-window modeling could potentially furtherimprove the separation performance.The recently proposed dual-path (DP) recurrent neural network(DPRNN) [26] has been shown promising for speech separationtasks. The DPRNN splits the long input sequence into smaller,length-fixed windows, and applies two types of RNN layers, namelyintra-window RNN and inter-window RNN iteratively on segmentedwindows. The alternating modeling architecture allows the networkto access information across windows that are far apart in time, whilemaintaining the separation performance for each local window, thusmaking DPRNN a promising choice for long sequence modeling. Ina recent work [27], the authors applied the dual-path (or multi-path ) to long recording separation and achieved promising results. How-ever, the initial experiments only considered a maximal number of speakers in the entire meeting which only consists of close talkutterances, and the recording-level permutation is aligned across allthe windows during training. In [28], DPRNN for long recordingseparation has been initially investigated under a simulated setup.In this paper, we further investigate the dual-path modeling inthe CSS framework under the realistic setup. Similar to DPRNN, weiteratively stack the local and global processing models for long se-quence modeling. We compare two kinds of the most popular mod-els for the dual-path modeling, the RNN and transformer [29,30]. Inthe RNN-based DP models, we compare the dual-path bidirectionallong-short memory (DP-BLSTM) with the baseline BSLTM on dif-ferent window sizes. And the unidirectional LSTM is also exploredfor global modeling, which allows the system to be deployed to theonline meeting processing. In the transformer-based DP models, anadditional sampling method is proposed to reduce the computationcost as well as improve the separation performance. The experi-ments show that the dual-path modeling method not only improvesthe speech separation performance on simulated testing set, but alsoeffectively reduces the word error rate in automatic speech recogni-tion evaluation on real meeting recordings.
2. CSS: TASK DEFINITION AND BASELINE
The pipeline of conventional continuous speech separation (CSS) isillustrated in Figure 1. It consists of three stages: segmentation , separation and stitching . a r X i v : . [ ee ss . A S ] F e b F ... K FP BNB
Feature Exraction
Dual-PathModelingBlock Continuous Output Stream 1KContinuous Mixture Input Continuous Output Stream 2
Repeat for R times
LocalModelingFCLayerNormReshape(B, K, N) Global ModelingFCLayerNormReshape(K, B, N)
Input outputKK N B K NDual-PathModelingBlockA) Segmentation C) StitchingD) Dual-Path Modeling BlocksB) Separation B
Fig. 1: A-C ): The continuous speech separation pipeline. A ):The segmentation stage splits the long recording into short windows withwindow size K and hop length P . B ): The separation stage performers the speech separation for each window. C ): The stitching stageconcatenates the separated windows into continuous outputs which only contain non-overlapped speech. D ): An illustration of the DP block.Denote W ∈ R L × F as the magnitude spectrum of the single-channel continuous mixture input, where F is the number of fre-quency bins and L is the number of frames. The segmentation stagesplits W into B windows D b ∈ R K × F , b = 1 , · · · B with windowsize K and hop size P . Then the segmented entire meeting can bepresented as a three-D tensor T = [ D , · · · , D B ] ∈ R B × K × F , ontop of which, a feature extraction module is applied to form the fea-ture for separation step, which has the shape ˆ T ∈ R B × K × N with N referring to feature dimension.Then for each window, C streams of output O b ∈ R K × F × C areestimated by the separation module, where C is the number of theoutput channels. We set C as in this work, by assuming that thenumber of overlapped speakers is less than at most time [21]. Themask based BLSTM separation network is used as the baseline inthis work, with phase sensitive mask [31] as network output.After obtaining separation result for each window, the stitchingstep is applied to align the permutation between adjacent windowoutputs, by finding the permutation that maximizes the similarityfrom separation results on the shared region between adjacent win-dows. And final result is estimated by a simple overlap-and-add stepto connect the local separation result to form output SO c ∈ R L × F with the same length as mixed signal.
3. DUAL-PATH MODELING FOR CSS3.1. Dual-Path Modeling
As Figure 1. B) shows, the DP model stacks R repeats of the basicDP blocks, the details of one DP block is illustrated in Figure 1. D).Each DP block consists of two sequence modeling layers, namely thelocal and global processing layer, where the former focuses on theshort term signal modeling, and the latter captures the long span in-formation across windows. With the 3-D tensor as the input feature,global and local layer perform sequence modeling on different axes.By alternating them in a deep DP network, the information from thelong sequence can pass across the window, i.e. enabling the networkto optimize for the entire long sequence, rather than each local win-dow as in baseline system. Meanwhile, as each sequence layer onlymodels part of the entire sequence, the learning efficiency is sig-nificantly improved compared with a single sequence layer for longsequence modeling. Denote the bottleneck input feature as ˆ T = [ ˆ D , · · · , ˆ D B ] ∈ R B × K × N , the local layer firstly performs the intra-window process-ing for each individual window ˆ D b ∈ R K × N : E b = f local (cid:16) ˆ D b (cid:17) (1)Where f local ( · ) is the local layer transformation function, E b ∈ R K × H refers to the processed feature and H is the hidden dimen-sion of the sequential model. E b is then processed by a bottleneckfully connect (FC) layer and a layer-norm (LN) [32] to build theresidual connection [33]: L b = ˆ D b + LN(FC( E b )) (2)Where L b ∈ R K × N is the final output of the local processing.All outputs from all the windows form another 3-D tensor L =[ L , · · · , L B ] ∈ R B × K × N . Then, before the global processing, the3-D tensor is reshaped and indexed as L k = L [: , k, :] ∈ R B × N , k =1 , · · · , K . The global modeling is applied to L k along the dimen-sion B : Q k = f global ( L k ) (3)where f global ( · ) is the global sequential modeling function, and Q k ∈ R B × H is the global processed feature. Similar to the local process-ing, the bottleneck FC, layer-norm and residual connection is ap-plied: G k = L k + LN(FC( Q k )) (4)Where G k ∈ R B × N is the output of the global processing. The rear-ranged output G = [ G , · · · , G K ] ∈ R B × K × N serves as the inputof the next DP block. The output of last DP block ˆ G ∈ R B × K × N is passed to a FC layer with ReLU activation function to gener-ate two T-F masks M b , M b ∈ R K × F for each window’s magni-tude spectrum D b . The masks are applied to the magnitude spec-trum by element-wise production to obtain the predicted spectrum S b , S b ∈ R K × F for each window.The window-level permutation invariant training (PIT) is ap-plied during training. It should be noted that the permutation be-tween different windows can be different. The training objective isthe signal-to-noise ratio (SNR) in the time domain: SNR( s , ˆ s ) = 10 log (cid:107) ˆ s (cid:107) (cid:107) ˆ s − s (cid:107) (5)here s and ˆ s is the estimated and the reference signal of a sin-gle window. The stitching is performed during the inference phase.We calculated the similarity between the predicted mask of adjacentwindows to determine the permutation of stitching. Dual-PathModeling BlockB K NDual-PathModeling Block1D Convolution Dual-PathModeling BlockB K' N B K' N Repeat for (R - 2) times1D Transposed-ConvolutionDual-PathModeling BlockB K N
Fig. 2:
The boosted dual-path modeling approach. The 1D con-volution layer downsamples the feature on the dimension K . Thesize-reduced feature is then processed by the following DP blocks.Before the last DP block, a transposed 1-D convolution layer upsam-ples the feature to the original length. In this paper, we introduce two updates to the plain DP models, toobtain better separation performance as well as computational effi-ciency.First, the transformer encoder layer [29] is used to replace theRNN in the DP models, which has been shown more effective thanRNN in many speech related tasks [34]. It is noted that a very recentwork [30] makes the similar update to DPRNN, but the initial exper-iments are limited in conventional close-talk utterance-level separa-tion.Second, we proposed a simple method to improve the DP trans-former. As Figure 2 shows, a 1D convolution layer is inserted be-tween the first and the second DP blocks in the separation net. The1D convolution is performed on the dimension K and it downsam-ples the intermediate feature ˆ T ∈ R B × K × N into smaller size ˜ T ∈ R B × K (cid:48) × N , where K = λK (cid:48) and λ is the sampling factor. Beforethe last DP block, the intermediate feature ˜ T R − ∈ R B × K (cid:48) × N isprocessed by a transposed 1D convolution and upsampled back tothe tensor ˆ T R − ∈ R B × K × N which has the same shape as the in-put, where R is the number of repeated DP blocks. There are twomotivations for this convolution-based resampling in the DP model.First, it can effectively reduce the computation cost especially when R becomes large and a proper λ is chosen. Second, the convolutionkernel makes the local information better presented in a single frameof one local window, which may benefit the global information in-teraction.
4. EXPERIMENTS4.1. Dataset
We aim to compare the separation performance in the real applica-tion. LibriCSS [23] is used as the testing set. It contains hoursof audio recordings in regular meeting rooms. Each mini-session in Readers can refer to [23] to get more details.
LibriCSS include speakers, and the overlap ratio ranges from to . The recordings are firstly processed by the separation models,and then the continuous input ASR evaluation is conducted.Given that LibriCSS only contains evaluation data, to train theseparation models, we create a training set that consists of artifi-cially simulated noisy and reverberant long-duration audios, basedon 16kHz LibriSpeech [35]. The reverbrant speech is created byconvolving the clean utterance with the simulated room impulse re-sponse(RIR) using image method [36]. To simulate the long conver-sation, we create virtual room, each containing multiple RIRs cor-responding to different speakers. We generate , , and virtual rooms for training, validation, and testing, respectively. Thewidth and length of all rooms are randomly sampled between and meters, and the height is between . to . meters. A micro-phone is randomly placed within the × m area in the centerof the room, and the height of the microphone is randomly sampledbetween . and . meters. In each simulated room, we randomlychoose candidate speakers from the LibriSpeech [35]. The lo-cations of these speakers are randomly set at least . meters awayfrom the wall, and the height of the speech source is between and meters. The reverberation time is randomly chosen between . and . seconds. We simulated meetings for training in eachsimulated room. While generating each meeting, we randomly pick − speakers in the current simulated room, and several utterancesof these speakers are randomly picked to create the speech mixture.The duration of the simulated meetings is between and sec-onds. The overlap ratio of each meeting is uniformly sampled be-tween and . The overlap region contains up to speaker,given that more than speakers talking simultaneously is very rarein real meetings [21]. An additional Gaussian noise with a randomSNR from to dB is then added to the mixture. We have totallysimulated k, , and meetings for training, validation, andtesting, respectively. In the feature extraction, the size of short-time Fourier transforma-tion (STFT) is -point and the hop length is . The window size K in the segmentation stage is selected from { , , , } ,which corresponding to { . , . , . , . } seconds, respectively.According to section 1, we reasonably assume that each small win-dow only contains up to 2 speakers, and the separation model gen-erates two outputs for each window. For all the models, the bottle-neck feature dimension N is set to . The RNN-based baselinemodel is a -layer BSLTM model; each layer contains forwardand backward hidden units. The RNN-based DP model con-tains repeats of the DP blocks; each block contains single-layerBSLTMs for local and global processing. The hidden unit is thesame as the RNN baseline. Thus the parameter size of the entiremodel is the same as the baseline. In the RNN-based DP models,online implementation has also been compared. In the online model,the global processing RNN is a unidirectional LSTM with hid-den units. The transformer baseline contains transformer encodelayers; the attention dimension is , and -head multi-head at-tention is used. The feed-forward layer in the transformer is dimensional. We use DP blocks in the transformer models to keepthe amount of parameter comparable with the baseline. The Adamoptimizer [37] is used in both kinds of models. The initial learningrates for RNN- and transformer-based models are . and . ,respectively. The warm-up scheduler [29] is used in the transformer-based models, with warm-up steps. In the RNN-based modeltraining, the learning rate is reduced by . every epoch when thevalidation loss does not decrease. The batch size is set to . All theodels are trained for epochs, and the best model on the valida-tion set is chosen. For the transformer-based models, the parametersof -best models are averaged to get the model for evaluation. Theexperiments are conducted using the ESPNet-SE [38] toolkit. Pre-stitching window-level SNR (dB) (2.4s) with different overlapratios for different models.
Models ModelSize (M) Overlap ratio in %0 0-25 25-50 50-75 75-100BLSTM 13.9 16.25
Trans. 8.2 16.15 8.15 9.79 9.49 8.79DP-Trans. 8.2
DP-Trans. + 10.1 16.14
We firstly evaluated the window-level SNR before the stitching stage on the simulation test set. The results are listed in Table.1.The SNR scores are reported on different overlap ratios. Results inTable.1 show that the DP models can consistently beat their baselinewith comparable parameter size except for the overlap ratio − conditions. WER (%) evaluation on LibriCSS for continuous speechseparation with different models. All our models in the table use thewindow size of . s. 0S/L [23]: 0% overlap ratio with short/longsilence. Systems
ModelSize (M)
MACs(Giga) Overlap ratio in %0S 0L 10 20 30 40Mixture [23] - - 15.4 11.5 21.7 27.0 34.3 40.5
BLSTM [23]
BLSTM
DP-BLSTM
Trans 8.2 31.5 16.0 14.4 19.0 22.6 29.5 33.5DP-Trans. 8.2 31.5 15.6 14.7 18.8 22.8
DP-Trans. + 10.1
The continuous ASR evaluation follows the same manner in[23], with the default ASR backend from LibriCSS dataset. Afterthe stitching stage, the separated overlap-free speech is fed into thepertained ASR evaluation pipeline, the word error rates (WERs) ofdifferent models are reported in Table.2.In Table.2, both of our BLSTM and the transformer baseline arestronger than those reported in [23] (the 2nd row). The DP-BLSTMgets better WERs compared to the BLSTM baseline, except for the0S results; The improvement of the DP transformer is relativelysmaller, but it still shows effectiveness in the overlapped meet-ings. The DP transformer equipped with convolution layers (last rowin table) reduces the amount of multiply-accumulate (MAC) opera-tions by relatively. At the meantime, it also shows better WERon most conditions, especially in meetings with low overlap ratios.
The bidirectional modeling (BLSTM or self-attention) is used forthe cross-window information interaction, so the above DP models
Table 3:
WER (%) evaluation on LibriCSS for continuous speechseparation with different local processing window size. The com-parison is conducted on dual-path and the baseline BLSTMs.
WindowSize Dual-Path WindowOnline Overlap ratio in %0S 0L 10 20 30 400.8s No Yes 16.1 12.7 19.9 25.0 31.8 36.4Yes No 15.0 12.8 18.1 22.9 28.3 31.7Yes Yes 14.7 13.2 18.6 24.3 29.3 32.71.6s No Yes 16.2 14.5 20.1 25.1 31.3 34.6Yes No 15.0 12.0 18.4 23.0 28.6 31.6Yes Yes 15.8 12.9 18.5 23.6 29.9 32.92.4s No Yes 15.3 13.6 18.6 24.9 30.4 33.9Yes No 16.0 12.1 18.6 24.1 29.1 32.7Yes Yes 15.6 12.4 18.4 23.6 29.9 32.83.2s No Yes 15.5 13.4 19.4 24.7 30.7 33.7Yes No 15.2 12.3 18.7 24.3 29.9 33.7Yes Yes 15.9 12.7 18.8 23.8 29.9 33.4 can not be directly applied to the online processing. One straight-forward way to enable the online processing for the DP models is toreplace the BLSTM with uni-directional LSTM(uni-LSTM) for thecross-window processing. It is also possible to build the DP trans-former for online processing, but we leave it for future work. Notethat under the LSTM global modeling setup, the maximum systemlatency is equal to the local window size. Table 3 compares the on-line DPRNNs with the offline DPRNNs and the baseline BLSTM.The models with different window sizes have been compared. Re-sults in Table 3 show that, for the local baseline model, the WERsget worse when the window size becomes smaller. It is because thesmaller window size leads to less local information. While, for theDP models, they always outperform their baseline local models. Oneinteresting finding is that, the smaller window size achieves betterWERs for the DP models. One possible explanation for this is thatthe smaller window size splits more windows, leading to finer reso-lution for global modeling and thus enhancing the information passacross windows. The last rows in each section of Table 3 list theWERs of the online dual-path models, which also show their effi-cacy compared to the baseline.
5. CONCLUSION
In this paper, we investigated the dual-path modeling for long record-ing speech separation in real meeting scenarios. We explored boththe RNN- and Transformer-based dual-path models, and the exper-imental results showed that the dual-path models outperformed thebaseline consistently in the CSS task. We proposed a dual-path trans-former with convolutional sampling, which reduces the computationamount by , and get relative WER reduction on LibriCSSmeeting recordings compared to the baseline. The online dual-pathmodel also achieved relative WER reduction, which makes it astrong candidate for online continuous speech separation.
6. ACKNOWLEDGMENTS
Chenda Li and Yanmin Qian were supported by the China NSFCprojects (No. 62071288 and U1736202). The work reported herewas started at JSALT 2020 at JHU, with support from Microsoft,Amazon, and Google. . REFERENCES [1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deepclustering: Discriminative embeddings for segmentation andseparation,” in
Proc. IEEE ICASSP , 2016, pp. 31–35.[2] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey,“Single-channel multi-speaker separation using deep cluster-ing,”
Proc. ISCA Interspeech , pp. 545–549, 2016.[3] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation in-variant training of deep models for speaker-independent multi-talker speech separation,” in
Proc. IEEE ICASSP , 2017, pp.241–245.[4] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalkerspeech separation with utterance-level permutation invarianttraining of deep recurrent neural networks,”
IEEE/ACM Trans.ASLP. , vol. 25, no. 10, pp. 1901–1913, 2017.[5] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor net-work for single-microphone speaker separation,” in
Proc. IEEEICASSP , 2017, pp. 246–250.[6] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independentspeech separation with deep attractor network,”
IEEE/ACMTrans. ASLP. , vol. 26, no. 4, pp. 787–796, 2018.[7] Y. Luo, Z. Chen, J. R. Hershey et al. , “Deep clustering and con-ventional networks for music separation: Stronger together,” in
Proc. IEEE ICASSP , 2017, pp. 61–65.[8] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative ob-jective functions for deep clustering,” in
Proc. IEEE ICASSP ,2018.[9] Y. Luo and N. Mesgarani, “Tasnet: time-domain audio separa-tion network for real-time, single-channel speech separation,”in
Proc. IEEE ICASSP , 2018, pp. 696–700.[10] ——, “Conv-TasNet: Surpassing ideal time–frequency magni-tude masking for speech separation,”
IEEE/ACM Trans. ASLP. ,vol. 27, no. 8, pp. 1256–1266, 2019.[11] C. Xu, W. Rao, E. S. Chng, and H. Li, “Time-domain speakerextraction network,” in
Proc. IEEE ASRU , 2019, pp. 327–334.[12] P. Wang, Z. Chen, X. Xiao et al. , “Speech separation usingspeaker inventory,” in
Proc. IEEE ASRU , 2019.[13] N. Zeghidour and D. Grangier, “Wavesplit: End-to-endspeech separation by speaker clustering,” arXiv preprintarXiv:2002.08933 , 2020.[14] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-endmicrophone permutation and number invariant multi-channelspeech separation,” in
Proc. IEEE ICASSP , 2020, pp. 6394–6398.[15] D. Yu, X. Chang, and Y. Qian, “Recognizing multi-talkerspeech with permutation invariant training,” in
Proc. ISCA In-terspeech , 2017, pp. 2456–2460.[16] S. Settle, J. Le Roux, T. Hori et al. , “End-to-end multi-speakerspeech recognition,” in
Proc. IEEE ICASSP , 2018, pp. 4819–4823.[17] X. Chang, W. Zhang, Y. Qian et al. , “Mimo-speech: End-to-end multi-channel multi-speaker speech recognition,” in
Proc.IEEE ASRU , 2019, pp. 237–244.[18] W. Zhang, X. Chang, Y. Qian, and S. Watanabe, “Improvingend-to-end single-channel multi-talker speech recognition,”
IEEE/ACM Trans. ASLP. , vol. 28, pp. 1385–1394, 2020.[19] T. von Neumann, C. Boeddeker, L. Drude et al. , “Multi-talkerasr for an unknown number of sources: Joint training of source counting, separation and asr,” in
Proc. ISCA Interspeech , 2020.[20] N. Kanda, X. Chang, Y. Gaur et al. , “Investigation of end-to-end speaker-attributed asr for continuous multi-talker record-ings,” arXiv preprint arXiv:2008.04546 , 2020.[21] ¨O. C¸ etin and E. Shriberg, “Analysis of overlaps in meetings bydialog factors, hot spots, speakers, and collection site: Insightsfor automatic speech recognition,” in
Ninth international con-ference on spoken language processing , 2006.[22] T. Yoshioka, I. Abramovski, C. Aksoylar et al. , “Advancesin online audio-visual meeting transcription,” in
Proc. IEEEASRU , 2019, pp. 276–283.[23] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu,X. Xiao, and J. Li, “Continuous speech separation: Dataset andanalysis,” in
Proc. IEEE ICASSP , 2020, pp. 7284–7288.[24] Q. Wang, H. Muckenhirn, K. Wilson et al. , “VoiceFilter: Tar-geted Voice Separation by Speaker-Conditioned SpectrogramMasking,” in
Proc. Interspeech 2019 , 2019, pp. 2728–2732.[25] M. Delcroix, K. Zmolikova, K. Kinoshita et al. , “Single chan-nel target speaker extraction and recognition with speakerbeam,” in
Proc. IEEE ICASSP , 2018, pp. 5554–5558.[26] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: effi-cient long sequence modeling for time-domain single-channelspeech separation,” in
Proc. IEEE ICASSP , 2020, pp. 46–50.[27] K. Kinoshita, T. von Neumann, M. Delcroix et al. , “Multi-pathRNN for hierarchical modeling of long sequential data and itsapplication to speaker stream separation,” in
Proc. ISCA Inter-speech , 2020.[28] C. Li, Y. Luo, C. Han et al. , “Dual-path RNN for long record-ing speech separation,” in
Proc. IEEE SLT , 2021, pp. 865–872.[29] A. Vaswani, N. Shazeer, N. Parmar et al. , “Attention is all youneed,” in
Advances in neural information processing systems ,2017, pp. 5998–6008.[30] J. Chen, Q. Mao, and D. Liu, “Dual-path transformer net-work: Direct context-aware modeling for end-to-end monauralspeech separation,”
Proc. ISCA Interspeech , 2020.[31] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux,“Phase-sensitive and recognition-boosted speech separationusing deep recurrent neural networks,” in
Proc. IEEE ICASSP ,2015, pp. 708–712.[32] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450 , 2016.[33] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings indeep residual networks,” in
European Conference on ComputerVision . Springer, 2016, pp. 630–645.[34] S. Karita, N. Chen, T. Hayashi et al. , “A comparative studyon transformer vs rnn in speech applications,” in
Proc. IEEEASRU , 2019, pp. 449–456.[35] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an ASR corpus based on public domain audiobooks,” in
Proc. IEEE ICASSP , 2015, pp. 5206–5210.[36] J. B. Allen and D. A. Berkley, “Image method for efficientlysimulating small-room acoustics,”
The Journal of the Acousti-cal Society of America , vol. 65, no. 4, pp. 943–950, 1979.[37] D. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[38] C. Li, J. Shi, W. Zhang et al. , “ESPNet-SE: End-to-end speechenhancement and separation toolkit designed for ASR integra-tion,” in