[PDF] Speaker and Direction Inferred Dual-channel Speech Separation

Abstract

Most speech separation methods, trying to separate all channel sources simultaneously, are still far from having enough general- ization capabilities for real scenarios where the number of input sounds is usually uncertain and even dynamic. In this work, we employ ideas from auditory attention with two ears and propose a speaker and direction inferred speech separation network (dubbed SDNet) to solve the cocktail party problem. Specifically, our SDNet first parses out the respective perceptual representations with their speaker and direction characteristics from the mixture of the scene in a sequential manner. Then, the perceptual representations are utilized to attend to each corresponding speech. Our model gener- ates more precise perceptual representations with the help of spatial features and successfully deals with the problem of the unknown number of sources and the selection of outputs. The experiments on standard fully-overlapped speech separation benchmarks, WSJ0- 2mix, WSJ0-3mix, and WSJ0-2&3mix, show the effectiveness, and our method achieves SDR improvements of 25.31 dB, 17.26 dB, and 21.56 dB under anechoic settings. Our codes will be released at this https URL

Full PDF

SSPEAKER AND DIRECTION INFERRED DUAL-CHANNEL SPEECH SEPARATION

Chenxing Li , , Jiaming Xu , † , Nima Mesgarani , Bo Xu , † Institute of Automation, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Columbia University, New York, NY, USA

ABSTRACT

Most speech separation methods, trying to separate all channelsources simultaneously, are still far from having enough general-ization capabilities for real scenarios where the number of inputsounds is usually uncertain and even dynamic. In this work, weemploy ideas from auditory attention with two ears and propose aspeaker and direction inferred speech separation network (dubbedSDNet) to solve the cocktail party problem. Speciﬁcally, our SDNetﬁrst parses out the respective perceptual representations with theirspeaker and direction characteristics from the mixture of the scenein a sequential manner. Then, the perceptual representations areutilized to attend to each corresponding speech. Our model gener-ates more precise perceptual representations with the help of spatialfeatures and successfully deals with the problem of the unknownnumber of sources and the selection of outputs. The experimentson standard fully-overlapped speech separation benchmarks, WSJ0-2mix, WSJ0-3mix, and WSJ0-2&3mix, show the effectiveness, andour method achieves SDR improvements of 25.31 dB, 17.26 dB,and 21.56 dB under anechoic settings. Our codes will be releasedat https://github.com/aispeech-lab/SDNet . Index Terms — dual-channel speech separation, speaker anddirection-inferred separation, cocktail party problem.

1. INTRODUCTION

In many environments, the auditory scene is composed of severalconcurrent speech streams with their spectral features overlappingboth in space and time. Human auditory system exhibits a remark-able ability to parse these complex scenes. However, backgroundnoise, overlapping speech, and reverberation damage the quality anddegrade the performance of speech recognition.Recently, some researchers attempt to alleviate the problem andpay extensive attention to neural network-based speech separation.In the single-channel-based separation task, many methods haveachieved state-of-the-art (SOTA) performance, such as frequencydomain-based DPCL [1], DANet [2], PIT [3], Chimera++ [4],CBLDNN-GAT [5], SPNet [6], Deep CASA [7] and time domain-based TasNet [8], FurcaPa [9], DPRNN [10]. These methods designthe model structure from different perspectives and follow differenttraining strategies, where the factors affecting performances areinvestigated in depth. However, these methods meet several chal-lenges: an unknown number of sources in the mixture, permutationproblem, and selection from multiple outputs.In order to deal with the situation that the number of sources inmixed speech is unknown, paper [11] incorporates DPCL into themasking-based beamforming and performs separation. OR-PIT [12] † Corresponding author separates only one speaker from a mixture at a time, and the resid-ual signal is sent to the separation model for the recursion to sepa-rate the next speaker. An iteration termination criterion is proposedto identify the number of speakers accurately. A speaker-inferredmodel [13] uses the Seq2Seq-based method [14, 15] to infer speak-ers. Speaker information is also appended to the output. Auxiliaryautoencoding PIT [16] is proposed to further improve the perfor-mance across various numbers of speakers.Speaker-aware-based networks [17–21] try to deal with theproblem of permutation and selection from outputs. These methodsare interested in recovering a single target speaker while reducingnoise and the effect of interfering speakers. The reference speechfrom the target speaker should be given in advance.In addition to the single-channel-based methods, multi-channel-based methods can extract additional direction features to furtherimprove the performance, and some methods are proposed to solvethe problems of permutation and output selection. Similar to thespeaker-aware-based networks, Li et al. [22] use ﬁxed beamformersto transfer the multi-channel mixture into single-channel signals. Anattention network is designed to identify the direction of the targetspeaker and combine the beamformed signals. SpeakerBeam [18] isthen applied to separating the enhanced signal. The direction-aware-based method [23] focuses on the target source in a speciﬁc directionby using a time domain-based network.PIT-based methods [3,5,6,8–10] need prior knowledge about thenumber of speakers and meet the permutation problem. These meth-ods have some shortcomings in real environments. In the existingmethods account for identifying the number of outputs, [12] requiresiterative operations, which increases system complexity. [11,12] stillcan not solve the problem of the selection of outputs. Speaker-aware-based methods need to know the target speaker in advance. Thespeech of other speakers cannot be separated. Besides, in single-channel-based methods, speakers with similar pitch are difﬁcult tobe separated. By extending it to multi-channel-based methods, anadditional direction feature can be acquired by the network.In real environments, a source signal has a unique speaker anddirection information. We propose a speaker and direction-inferreddual-channel speech separation network (SDNet), which can inferspeaker and direction information ﬁrst and use them as cues to sep-arate speech. Our contributions are listed as follows: (1) We expandsingle-channel to multi-channel time domain-based separation basedon [13]. Spectral and spatial features are fully utilized. (2) Insteadof manually extracting channel differences in [25, 26], the channeldifferences are extracted by the network and can be optimized end-to-end. (3) This network can simultaneously infer speaker and direc-tion information, and the information is fused as a source mask forseparation. By dynamically estimating the number of source masks,the network can cope with the problem of the unknown number ofoutputs. (4) After separation, speaker and direction information are a r X i v : . [ c s . S D ] F e b H2CH1

Spk 1Spk 2 Spk 3Three-talker Mixture Scene - d C onv E n c od e r E E IAC I A C Concat

Feature Extraction Module

ConvDConvSConv ConvDConvSConv ConvDConvSConv ... B L S T M Attentive Speaker DecoderAttentive Direction Decoder

Separation Module

Feature Fusion sigmod - d C onv D ec od e r Output I n f ere n ce M o du l e Spk1Spk2Spk3 < E O S >< E O S > Fig. 1 : The model architecture of SDNet. In the separation module, SConv and DConv represent the depth-wise separable convolution [24].appended to the separated speech. This information can be used insubsequent tasks. The network can deal with the problem of the se-lection of outputs.Scale-invariant signal-to-noise ratio (SISNR) [8] and signal-to-distortion ratio (SDR) [27] improvements are used to evaluate theperformance. Experimental results show that SDNet can effectivelyperform separation both on the anechoic and reverberant settings.

2. SYSTEM OVERVIEW

The illustration of our model is shown in Fig. 1. Our network iscomposed of three components: (1) The feature extraction moduleprocesses features from each channel and extracts the channel dif-ferences; (2) The inference module parses out the speakers and di-rections from the mixture and generates source masks; (3) The sepa-ration network processes features and integrates the source masks togenerate the separated outputs.

The encoder transforms the mixture waveforms into an intermediatefeature space. In detail, the input segment is transformed into therepresentation by using a one-dimensional (1D) convolutional layer. E i = Conv1d ( CH i ) , i = 1 , , (1)where E indicates the encoder, and CH i represents the waveform of i -th channel. Compared with the single-channel-based models, dual-channel-based models can use both spatial and spectral information. Thisis conducive to the improvement of performance. The time differ-ence between the channels can be obtained by end-to-end trainingor manually-designed features [25, 26]. We directly calculate thecorrelation among the channels and integrate it into the network asan additional feature. In detail, similar to the self-attention [28],we calculate the attention correlation between channels. By settingchannel 1 as the reference, the inter-channel attention correlation(IAC) is as follows:

IAC = softmax ( E E T2 ) , (2)where E and E represent the output of encoder 1 and encoder 2,respectively. Channel differences may contribute to the inferencemodule. The feature extraction module has two different outputs,and the outputs are: F = [ E , E ] , F o = [ IAC , E , E ] , (3)where F o is sent to the inference module, and F is fed into the sep-aration module. In the inference module, the Seq2Seq-based mechanism [14, 15] isapplied to inferring the speakers and directions in a sequence man-ner. First, the features are mapped into high-level vectors by usingstacked bi-directional long short term memory (BLSTM) layers. Thespeciﬁc equation is: h = BLSTM ( F o ) , (4)where h is the hidden state, and F o represents the input feature ofthe inference module.We use two independent decoding networks to infer the speakersand the directions, respectively. Considering not all speech featuresmake contributions to infer the speakers and directions equally ateach step, the attention mechanisms are utilized to produce contextvectors by focusing on different portions of the sequence and aggre-gating the hidden representations. Two attentive decoding networkshave similar procedures. Here we formulate the speaker-inferred de-coder as follows: α ti = softmax ( tanh ( W s t − + U h i )) , (5) c t = T (cid:88) i =1 α ti h i , (6)where W , U are weights, and s t − is the hidden state of the de-coder at time-step t − . c t is the context vector at time-step t .For the decoding networks, a global embedding strategy is in-troduced to alleviate the problem of exposure bias [15], and the em-bedding feature at t is calculated as follows: e at = N (cid:88) j =1 y jt − e j , (7) = sigmoid ( W e t + U e at ) , (8) e s t = g (cid:12) e t + ( − g ) (cid:12) e at , (9)where N is the number of speakers. y jt − is the j -th element of y t − and e j is the embedding vector of the j -th speaker. e at denotesthe weighted average embedding at time t . W and U are weightmatrices. e s t represents the speaker embedding at time t . (cid:12) denoteselement-wise multiplication. The hidden state s t of the decoder attime-step t is computed as follows: s t = LSTM ( s t − , [ e s t ; c t ]) . (10)The ﬁnal output is calculated as: y t = softmax ( W f ( W s t + W c t )) , (11)where W , W , and W are weights. y t represents the inferred prob-ability distribution of the inferred speaker at time-step t . In eachtime step, rather than selecting the ﬁnal output y t , the speaker em-bedding, e s t , is selected as the speaker mask. When the inferred y t corresponds to an < EOS > (End-of-Sequence), the decoding processis stopped.The inference process of direction mask, e d t , is the same as theequations above. The source mask can be obtained as follows: sm t = e s t + e d t , (12)where sm t means the t -th source mask inferred by the inferencemodule. In the inference module, these two attentive decoders runsimultaneously. If one decoder infers an < EOS > , the two decodersare stopped. In the test, the beam search algorithm [29] is applied toﬁnding the top-ranked inference. Temporal convolutional networks (TCN) [30] effectively memorizelong-term dependencies. Dilation rate [30] is used to continuouslyexpand the receptive ﬁeld. The separation module is the same as theseparation module in TasNet [8]. In detail, the streamline of the sep-aration module consists of four convolutional blocks. In each block,for expanding receptive ﬁelds, dilated convolutional operations arerepeat R times with 1,2,4,..., and R − dilation rates. A sigmoidactivation then scales the output.To generate separated outputs, the decoding process is the in-verse process of the encoding layer. It decodes the feature represen-tation to speech samples. Speciﬁcally, we use 1D transposed convo-lution to implement the decoding process: Z i = F (cid:12) TCN o (cid:12) sm i , i = , ..., n , D ( Z ) i = TransposedConv ( Z i ) , i = , ..., n , (13)where T CN o denotes the output of TCN layers. Z i represents thehigh-level feature representatives of i -th inferred source. n is thenumber of source masks infered in this mixture. D ( · ) i representsthe i -th separated output. End-to-end training is performed, and three kinds of loss areadopted: raw-waveform-based SiSNR separation loss, cross-entropy-based speaker-inferred loss, and cross-entropy-based direction-inferred loss. The detailed loss function is formulated as: L = −L SiSNR − SS + λ × ( L CE − Spk + L CE − Dir ) , (14) Table 1 : The effect of SNet-time in single-channel anechoic datasetsand comparison of different methods on SDR improvement (dB).

System WSJ0-2mix WSJ0-3mix WSJ0-2&3mixSNet-time .

35 9 .

87 10 . SNet [13] .

52 5 .

14 7 . DPCL++ [31] . . . uPIT-BLSTM [3] . . . TasNet [8] . . − OR-PIT [12] . . − where λ is a hyper-parameter. For the inference module, speaker in-dexes act as the speaker labels, which are 101 in this experiment. 37directions are chosen as the direction labels, which are distributedfrom 0 degrees to 180 degrees with a 5-degree interval. The labelsof direction are generated during the data simulation. Meanwhile, < BOS > (Begin-of-Sequence) and < EOS > are also added to thespeaker and direction label sets. For each sample, < BOS > is placedat the top of the labels, and < EOS > is placed at the end. < BOS > means that the network starts to infer. < EOS > is used for the net-work to determine the end of decoding.

3. EXPERIMENTS3.1. Experimental setup

The proposed methods are evaluated on 8k Hz single and dual-channel WSJ0-2mix, WSJ0-3mix, and WSJ0-2&3mix datasets [1].For both single-channel and stereo datasets, WSJ0-2mix and WSJ0-3mix contain 30 hours of training data, 10 hours of developmentdata, and 5 hours of test data. The mixing signal-to-noise ratio,pairs, dataset partition are exactly coincident with paper [1]. WSJ0-2&3mix is the union of WSJ0-2mix and WSJ0-3mix. Anechoic andreverberant stereo datasets are generated by convolving the cleanspeech with the room impulse responses [32, 33]. For reverberantdatasets, the reverberation time is uniformly sampled from 40 msto 200 ms. We place 2 microphones at the center of the room. Thedistance between microphones is 10 cm. Sound sources are ran-domly placed in the room. The training set and the test set contain101 and 18 speakers, respectively. The speakers in the test set aredifferent from the speakers in the training set and the developmentset. During training, the label order in the inference module is sortedin descending order according to speech energy.In (inChannel, outChannel, kernel, stride)-format, for the en-coder in the feature extraction module, 1D convolution has (1, 256,40, 20)-kernel with no pooling. This corresponds to a frame lengthof 5 ms and a 2.5 ms shift. In the inference module, BLSTM layershave 3 layers with 256 nodes in each direction. Two LSTM-baseddecoders both run with 3 layers with 512 nodes. The dimension ofthe speaker and the direction embedding is 256. In the separationmodule, TCN runs with four convolution blocks and R = 8 in eachblock. Transposed convolution runs with (256, 1, 40, 20)-kernel. Forloss, λ = 5 . For SDNet, the input is raw-waveform, and it outputsraw-waveform. In our experiments, we build several baselines. SNet [13] acts asthe baseline and is performed in the frequency domain. SNet-2chrepresents the dual-channel version of SNet. We also build a dual-channel TasNet, named TasNet-2ch, whose channel differences are able 2 : The effect of different conﬁgurations on dual-channel datasets and comparisons on SISNR and SDR improvement (dB).

System Domain Data Type WSJ0-2mix WSJ0-3mix WSJ0-2&3mixSISNRi SDRi SISNRi SDRi SISNRi SDRiSNet-2ch Freq. Anechoic .

62 14 .

25 10 .

31 10 .

03 11 .

12 11 . SNet-time-2ch Time Anechoic .

88 20 .

61 14 .

32 14 .

11 17 .

43 17 . SNet-time-2ch+IAC Time Anechoic .

13 20 .

89 15 .

41 15 .

02 18 .

65 18 . SDNet Time Anechoic .

71 25 .

31 17 .

46 17 .

26 21 .

92 21 . TasNet-2ch Time Anechoic .

21 25 .

08 17 .

31 17 . − − SNet-2ch Freq. Reverberant .

32 7 .

28 5 .

53 5 .

15 6 .

53 6 . SNet-time-2ch Time Reverberant .

43 8 .

35 6 .

62 6 .

41 7 .

32 7 . SNet-time-2ch+IAC Time Reverberant .

76 8 .

59 6 .

93 6 .

86 7 .

88 7 . SDNet Time Reverberant .

57 10 .

64 8 .

49 8 .

55 9 .

91 9 . TasNet-2ch Time Reverberant .

78 10 .

83 9 .

08 9 . − − learned in an end-to-end manner. These models are trained with thesame datasets as our models. Learned from the experimental results in Table 1 and Table 2, theproposed methods can effectively separate the mixed speech. InTable 1, SNet is ﬁrst transferred into the time domain as SNet-time.Compared with SNet, SNet-time achieves performance improve-ment, which attributes to the time-domain-based end-to-end train-ing. SNet-time-2ch means the dual-channel SNet-time. Comparedwith SNet-time, SNet-time-2ch achieves a signiﬁcant performanceimprovement. It means that the spatial information can be utilizedby our network to improve the performance.IAC is used to extract the differences between channels. Theextracted features are only used in the inference module. The time-domain-based dual-channel model with IAC is named as SNet-time-2ch+IAC. As shown in Table 2, the models with IAC have achievedperformance improvement both on the anechoic and reverberantdatasets.When reverberation is added, performance is degraded. SDNethas achieved performance improvements both on anechoic and re-verberant datasets. When separating the mixture, the speaker anddirection can be inferred by SDNet. The inferred speaker and direc-tion information is conducive to the selection of output. Our ﬁnalmodel, SDNet, can achieve SDR improvements of 25.31 dB, 17.26dB, and 21.56 dB on the anechoic WSJ0-2mix, WSJ0-3mix, andWSJ0-2&3mix datasets and 10.64 dB, 8.55 dB, and 9.08 dB on thereverberant WSJ0-2mix, WSJ0-3mix, and WSJ0-2&3mix datasets,respectively.The performances of SNet-time and SNet-time-2ch are worsethan the corresponding TasNets. This is due to the speaker mis-match between the training set and the test set, resulting in inac-curate speaker masks generated during the test. The direction in-ference mechanism in SDNet can effectively alleviate this problem.Reverberation has a negative impact on direction inference. SDNetperforms similar to TasNet-2ch, but it does not need prior knowledgeof the number of outputs.

The advantage of our proposed model is that it can dynamically es-timate the number of sound sources. In the WSJ0-2mix and WSJ0-3mix, the number of speakers mixed in speech is ﬁxed. We ﬁnd thatthe models learn this pattern. In these experiments, the inference

Table 3 : Inferring accuracy of source numbers on reverberantWSJ0-2&3mix dataset.

Model Accuracy (%)SNet-time . SNet-time-2ch . SNet-time-2ch+IAC . SDNet . accuracies are close to 100%. Therefore, we construct the WSJ0-2&3mix dataset and perform experiments on this dataset. The ex-perimental results are shown in Table 3.For comparison, SNet-time in Table 3 is performed on the samereverberant dataset but only on the reference channel. In Table 3,compared with SNet-time, the inference accuracy of SNet-time-2chhas been greatly improved, which indicates that the spatial informa-tion is learned by the model and used to increases the discriminationof the sound sources. Compared with SNet-time-2ch, SNet-time-2ch+IAC can infer the number of sound sources more accurately.This shows that the extracted channel differences are beneﬁcial toour system. SDNet achieves 89.73%, which indicates the proposedmethod can make better use of the spatial information.

4. CONCLUSIONS

We propose a time-domain-based speaker and direction-inferreddual-channel speech separation network, which ﬁrst infers thespeaker with direction and then integrates them as a source mask toseparate the mixed speech. Experimental results show that SDNeteffectively separates mixture under anechoic and reverberant condi-tions and deals with the problem of an unknown number of sourcesin the mixture and selection of outputs.

5. ACKNOWLEDGMENTS

This work was done when Chenxing Li working as a visiting studentat Columbia University. We thank Yi Luo and Cong Han for theirhelpful suggestions. Chenxing Li, Jiaming Xu, and Bo Xu werefunded by a grant from the Major Project for New Generation of AI(2018AAA0100400), and the Strategic Priority Research Programof the Chinese Academy of Sciences (XDB32070000). . REFERENCES [1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deepclustering: Discriminative embeddings for segmentation andseparation,” in

IEEE ICASSP , 2016, pp. 31–35.[2] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor networkfor single-microphone speaker separation,” in

IEEE ICASSP ,2017, pp. 246–250.[3] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalkerspeech separation with utterance-level permutation invarianttraining of deep recurrent neural networks,”

IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 25,no. 10, pp. 1901–1913, 2017.[4] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative objec-tive functions for deep clustering,” in

IEEE ICASSP , 2018, pp.686–690.[5] C. Li, L. Zhu, S. Xu, P. Gao, and B. Xu, “Cbldnn-basedspeaker-independent speech separation via generative adver-sarial training,” in

IEEE ICASSP , 2018, pp. 711–715.[6] Z.-Q. Wang, K. Tan, and D. Wang, “Deep learning based phasereconstruction for speaker separation: A trigonometric per-spective,” in

IEEE ICASSP , 2019, pp. 71–75.[7] Y. Liu and D. Wang, “Divide and conquer: A deep casa ap-proach to talker-independent monaural speaker separation,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, no. 12, pp. 2092–2102, 2019.[8] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing idealtime–frequency magnitude masking for speech separation,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, no. 8, pp. 1256–1266, 2019.[9] Z. Shi, H. Lin, L. Liu, R. Liu, J. Han, and A. Shi, “Deep atten-tion gated dilated temporal convolutional networks with intra-parallel convolutional modules for end-to-end monaural speechseparation,” in

Interspeech , 2019, pp. 3183–3187.[10] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efﬁ-cient long sequence modeling for time-domain single-channelspeech separation,” in

IEEE ICASSP , 2020, pp. 46–50.[11] T. Higuchi, K. Kinoshita, M. Delcroix, K. Zmol´ıkov´a, andT. Nakatani, “Deep clustering-based beamforming for separa-tion with unknown number of sources.” in

Interspeech , 2017,pp. 1183–1187.[12] N. Takahashi, S. Parthasaarathy, N. Goswami, and Y. Mit-sufuji, “Recursive speech separation for unknown number ofspeakers,” in

Interspeech , 2019, pp. 1348–1352.[13] J. Shi, J. Xu, and B. Xu, “Which ones are speaking? speaker-inferred model for multi-talker speech separation,” in

Inter-speech , 2019, pp. 4609–4613.[14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-lation by jointly learning to align and translate,” in , 2015.[15] P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang, “Sgm:Sequence generation model for multi-label classiﬁcation,” in

Proceedings of the 27th International Conference on Compu-tational Linguistics , 2018, pp. 3915–3926.[16] Y. Luo and N. Mesgarani, “Separating varying numbers ofsources with auxiliary autoencoding loss,” in

Interspeech ,2020. [17] J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y. Qian, and D. Yu,“Deep extractor network for target speaker recovery from sin-gle channel speech mixtures,” in

Interspeech , 2018, pp. 307–311.[18] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, andT. Nakatani, “Single channel target speaker extraction andrecognition with speaker beam,” in

IEEE ICASSP , 2018, pp.5554–5558.[19] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R.Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno,“Voiceﬁlter: Targeted voice separation by speaker-conditionedspectrogram masking,” in

Interspeech , 2019, pp. 2728–2732.[20] C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scaletime domain speaker extraction network,”

IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 28,pp. 1370–1384, 2020.[21] M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li,“Spex+: A complete time domain speaker extraction network,”in

Interspeech , 2020.[22] G. Li, S. Liang, S. Nie, W. Liu, M. Yu, L. Chen, S. Peng,and C. Li, “Direction-aware speaker beam for multi-channelspeaker extraction.” in

Interspeech , 2019, pp. 2713–2717.[23] R. Gu and Y. Zou, “Temporal-spatial neural ﬁlter: Directioninformed end-to-end multi-channel target speech separation,” arXiv preprint arXiv:2001.00391 , 2020.[24] F. Chollet, “Xception: Deep learning with depthwise separableconvolutions,” in

IEEE Conference on Computer Vision andPattern ecognition , 2017, pp. 1251–1258.[25] Z.-Q. Wang and D. Wang, “Integrating spectral and spatialfeatures for multi-channel speaker separation.” in

Interspeech ,2018, pp. 2718–2722.[26] R. Gu, J. Wu, S.-X. Zhang, L. Chen, Y. Xu, M. Yu, D. Su,Y. Zou, and D. Yu, “End-to-end multi-channel speech separa-tion,” arXiv preprint arXiv:1905.06286 , 2019.[27] E. Vincent, R. Gribonval, and C. F´evotte, “Performance mea-surement in blind audio source separation,”

IEEE Transactionson Audio, Speech, and Language Processing , vol. 14, no. 4, pp.1462–1469, 2006.[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is allyou need,” in

Advances in neural information processing sys-tems , 2017, pp. 5998–6008.[29] S. Wiseman and A. M. Rush, “Sequence-to-sequence learningas beam-search optimization,” in

Proceedings of the 2016 Con-ference on Empirical Methods in Natural Language Process-ing , 2016, pp. 1296–1306.[30] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluationof generic convolutional and recurrent networks for sequencemodeling,” arXiv preprint arXiv:1803.01271 , 2018.[31] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey,“Single-channel multi-speaker separation using deep cluster-ing,” in

Interspeech , 2016, pp. 545–549.[32] J. B. Allen and D. A. Berkley, “Image method for efﬁcientlysimulating small-room acoustics,”

The Journal of the Acousti-cal Society of America , vol. 65, no. 4, pp. 943–950, 1979.[33] E. A. Lehmann and A. M. Johansson, “Prediction of energy de-cay in room impulse responses simulated with an image-sourcemodel,”