[PDF] On permutation invariant training for speech source separation

Abstract

We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFT-based models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.

Full PDF

OON PERMUTATION INVARIANT TRAINING FOR SPEECH SOURCE SEPARATION

Xiaoyu Liu Jordi Pons

Dolby Laboratories

ABSTRACT

We study permutation invariant training (PIT), which targets at thepermutation ambiguity problem for speaker independent source sep-aration models. We extend two state-of-the-art PIT strategies. First,we look at the two-stage speaker separation and tracking algorithmbased on frame level PIT (tPIT) and clustering, which was origi-nally proposed for the STFT domain, and we adapt it to work withwaveforms and over a learned latent space. Further, we propose anefﬁcient clustering loss scalable to waveform models. Second, weextend a recently proposed auxiliary speaker-ID loss with a deepfeature loss based on “problem agnostic speech features”, to reducethe local permutation errors made by the utterance level PIT (uPIT).Our results show that the proposed extensions help reducing per-mutation ambiguity. However, we also note that the studied STFT-based models are more effective at reducing permutation errors thanwaveform-based models, a perspective overlooked in recent studies.

Index Terms — Speech source separation, permutation invarianttraining, waveform-based models, spectrogram-based models.

1. INTRODUCTION

The permutation ambiguity problem occurs when training speakerindependent deep learning models in a supervised fashion. Thegoal of such systems is to separate C speech sources x c ( t ) fromtheir mixture waveform y ( t ) = (cid:80) Cc =1 x c ( t ) . Accordingly, C ! permutations of speaker outputs ˆ x c ( t ) to ground truth x c ( t ) pairsexist for computing the loss. However, these pairs cannot be ar-bitrarily assigned due to the speaker independent nature of thetask. Utterance level permutation invariant training (uPIT) [1] ad-dresses this problem by minimizing the smallest separation loss ofall permutations computed over the entire utterance, thus enforcingpermutation consistency across frames. However, the assumptionthat all frames share the same permutation may lead to sub-optimalframe level separation, causing local speaker swaps and leakage [2].On the other hand, frame level PIT (tPIT) [3] performs PIT foreach frame independently, thus achieves excellent frame level sep-aration quality. However, the permutation frequently changes overframes at inference time. Improvements to PIT roughly fall intotwo categories: (i) designing a permutation (or speaker) trackingalgorithm for tPIT [2, 4, 5]; and (ii) designing better uPIT objec-tives to further strengthen permutation consistency [6–9]. Alongthese two lines, our work takes a close look at tPIT+clustering, arecent idea introduced by Deep CASA [2], that targets at accurateframe level separation (tPIT) and speaker tracking (clustering) intwo stages. We also explore another promising method based onuPIT+speaker-ID loss [9], that introduces an additional deep featureloss term (speaker-ID) to help uPIT reducing local speaker swaps.In this paper, we extend these two training strategies for Conv-TasNet [10], a fully convolutional version of TasNet [10–13] thatmodels speaker separation in the waveform domain.In section 2, we extend tPIT+clustering training algorithm by Deep CASA, an spectrogram-based model, to Conv-TasNet, whichuses very short waveform frames (such as 2 ms). We ﬁnd that tPITbased on such short waveform frames can be challenging. Therefore,we propose performing tPIT in a pre-trained latent space—whichallows for a more meaningful feature space for tPIT than the shortwaveform frames. Further, when training the clustering model, DeepCASA employs a memory and computationally expensive pairwisesimilarity loss that does not scale for waveform inputs. We proposea loss that reduces the complexity from quadratic to linear, makingthe training of the clustering model feasible for waveform models.In section 3, we also extend the uPIT+speaker-ID loss withPASE, a problem agnostic speech encoder [14–16]. PASE is pre-trained in a self-supervised fashion with a collection of objectivesmuch broader than speaker-ID, to extract general-purpose speechembeddings from waveforms. In addition, we also look at condi-tioning Conv-TasNet with PASE embeddings in a cascaded systemconsisting of two steps: (i) uPIT+PASE speaker separation, and(ii) conditioning Conv-TasNet with PASE embeddings computedfrom the speakers separated in step (i).Conv-TasNet and Deep CASA can both be interpreted as ar-chitectures with an encoder/decoder and a separator, where the en-coder/decoder in Conv-TasNet is learnable whereas in Deep CASAis the STFT. Previous works have already compared learnable andsignal processing-based encoders/decoders [17–21]. However, fromthe permutation ambiguity perspective, it remains an open questionif using waveform- or STFT-based models has any advantages. Sec-tion 4 shows that our tPIT+clustering and uPIT+PASE extensions forConv-TasNet help on reducing permutation errors and on generaliza-tion. However, Deep CASA outperforms the Conv-TasNet variantswe study, highlighting the advantages of STFT-based models and theremaining challenges for waveform-based models from the permu-tation ambiguity perspective.

2. tPIT + CLUSTERING FOR CONV-TASNET

In this section, we introduce the tPIT+clustering algorithm inDeep CASA, and explain how we adapt it for Conv-TasNet. DeepCASA [2] is formulated for two speakers ( C =2 ), but can be gener-alized to more speakers [4]. We also set C =2 for our work. We investigate several tPIT variants: (i) tPIT-STFT , the tPIT stepused in Deep CASA for spectrogram-based models; (ii) tPIT-time ,our extension for training with tPIT directly in the waveform do-main; and (iii) tPIT-latent , another extension we propose for train-ing Conv-TasNet with tPIT in a learned latent space. tPIT-STFT (Fig. 1, top) — In Deep CASA, the mixture signal y ( t ) is converted to a complex-valued STFT Y ( k, f ) where k , f denote the discrete frame and frequency indices. A separator, basedon Dense-UNet, computes the separated complex STFT ˆ X c ( k, f ) bypredicting a multiplicative mask for each speaker c . Next, for eachindividual frame, given the ground truth STFT, the best permutation a r X i v : . [ c s . S D ] F e b ig. 1 . tPIT training for spectrograms (top) and waveforms (bottom).with the smallest spectral L distance is found and used to reorganizeframes into speaker-consistent separations. After the inverse STFT,the signal-to-noise ratio (SNR) loss is used for training. tPIT-time (Fig. 1, bottom) — For adapting tPIT to the wave-form domain, we ﬁrst simply compute the tPIT loss directly inthe time domain. In this paper, we investigate Conv-TasNet [10],which employs a learned encoder/decoder instead of ﬁxed STFT.The encoder projects the mixture waveform into a latent space: E = UY , where Y ∈ R L × K stores K overlapping frames of length L and stride S , U ∈ R N × L contains N learnable basis ﬁlters, and E ∈ R N × K is the latent space representation of the mixture wave-form. The decoder performs the inverse mapping as: ˆX c = ˆS Tc V ,where V ∈ R N × L contains N decoder basis ﬁlters, ˆS c ∈ R N × K isthe c th output of the separator, and ˆX c ∈ R K × L contains K frames.In this work, we use L = 16 and S = 8 samples for our 8 kHzexperiments (and L = 32 and S = 16 for 16 kHz experiments) and N = 512 . Consecutive frames in ˆX c may not belong to the samespeaker due to permutation ambiguity. Accordingly, during training,the best tPIT permutation π ∗ k for each frame k can be computedindependently by the L distance in the waveform domain: π ∗ k = arg min π k ∈ P C (cid:88) c =1 (cid:12)(cid:12) ˆx c,k − x π k ( c ) ,k (cid:12)(cid:12) where P is the set of all C ! permutations, ˆx c,k is the k th frame ofthe c th separator output, and x π k ( c ) ,k denotes the same frame of theground truth source tied with the c th estimate under permutation π k .After reordering frames according to π ∗ k , overlap-and-add (OLA inFig. 1) is used to reconstruct speaker-consistent predictions. Finally,the scale invariant SNR (SI-SNR) [22] is used to train the model. tPIT-latent (Fig. 2) — TasNet-like architectures typically usemuch shorter frame length and stride than those of spectrogram-based models. Such short frames may not be adequate to accuratelyperform tPIT directly in the time domain. Thus, we propose to per-form tPIT in a learned latent space. However, due to its latent nature,it’s hard to deﬁne a pre-existing target for tPIT training. This is a ma-jor difference when comparing our work with Deep CASA, whichnaturally uses STFT as the ground truth. To overcome this chal-lenge, we train the encoder/decoder and the separator separately. InFig. 2 (top), the encoder converts the mixture signal y ( t ) and eachspeech source x c ( t ) to a latent space (with non-negative values afterthe ReLU). The softmax computes the ideal masks from the groundtruth signals, which are multiplied with the mixture representationto obtain the ideal latent features S c ∈ R N × K of the separated sig-nals, where N and K are the number of basis ﬁlters and framesrespectively. The decoder performs the inverse transform. The en-coder and decoder are learned by minimizing the SI-SNR loss. Notethat, in this step, the ground truth and the separated signals have thesame order. Hence, no permutation ambiguity is introduced, whichis important for S c to become targets when later training the sep- Fig. 2 . tPIT training in the latent space. First, train the en-coder/decoder to generate the optimal latent representation (top).Next, train the separator only with tPIT loss (bottom).arator. The idea of training the separator separately from the en-coder/decoder was ﬁrst introduced in [23]. Our contribution is to useit for setting a visible target for tPIT-latent. In Fig. 2 (bottom), theencoder/decoder is frozen (dashed line), and the separator is trainedby minimizing the tPIT loss as: loss tPIT = 1

KNC K (cid:88) k =1 min π k ∈ P C (cid:88) c =1 (cid:12)(cid:12) ˆs c,k − s π k ( c ) ,k (cid:12)(cid:12) where ˆs c,k is the k th frame (column) of the c th predicted source inthe latent space ˆS c , which is compared, by the L distance, with the k th frame of the pre-trained representation of source π k ( c ) where π k is a frame level permutation from the set P . Since tPIT separates speakers for each frame independently, the per-mutation frequently changes across frames in the separated signals.The goal of the clustering stage is to predict the permutation for eachframe. The original clustering step by Deep CASA is based on apairwise similarity loss in which the cosine distance between ev-ery pair of frames, projected in an embedding space, is computed tomatch either 1 or 0 (permute or not). Due to the small frame strideused by Conv-TasNet, this loss can be memory and computationallyintensive: O ( K ) , where K is the number of frames. We proposea much more efﬁcient solution based on the generalized end-to-end(GE2E) loss [24]. The latent features of the mixture E and the sep-arated signals ˆS and ˆS in Figs. 1, 2 are jointly fed through a sim-ilarity model, that yields an embedding vector that represents thepermutation of each frame. The similarity model is trained such thatframes having the same permutation are projected into clusters, en-forced by a similarity loss. For each frame, we compute its squaredEuclidean distance to the mean vectors of all clusters. Then, a per-mutation softmax (classiﬁcation) loss is minimized as: loss GE E = K (cid:88) k =1 − log exp − d ( h k,p , m p ) (cid:80) C ! i =1 exp ( − d ( h k,p , m i )) where h k,p is the embedding of the k th frame with a permutationlabel p (with p obtained from the tPIT stage), m i is the mean of the i th cluster, and d ( x , y ) = w (cid:107) x − y (cid:107) + b is the squared Euclideandistance with learnable scale w > , not to change the sign of d ( · ) ,and bias b . The GE2E loss was originally developed to train speaker-ID models [24], but we adopt it to solve a new problem: to train apermutation similarity model with approximately linear complexity(since C ! (cid:28) K ). Another change is the Euclidean distance, whichworks better than the cosine distance used for the original GE2E.At test time, K-means (also based on Euclidean distance) is usedto cluster the embedding. Finally, the separated latent frames arereordered based on the predicted permutations. ig. 3 . uPIT+PASE (solid) and its cascaded extension (dashed).

3. uPIT + PASE FOR CONV-TASNET

Nachmani et al. [9] augmented uPIT by a deep feature loss onspeaker embeddings extracted from a pre-trained speaker-ID net-work and achieved less permutation errors. We also explore thisdirection, but replacing the speaker-ID deep feature loss with PASE(see Fig. 3). The PASE [14, 15] encoder was pre-trained in a self-supervised fashion with objectives including speaker-ID, pitch,MFCCs, among others. Hence, PASE embeddings may containadditional relevant information, other than just speaker-ID. Also,the PASE encoder is fully differentiable. Thus, we can construct theuPIT+PASE loss as: loss = uP IT + C (cid:88) c =1 (cid:107) P ASE (ˆ x c ( t )) − P ASE ( x π ∗ u ( c ) ( t )) (cid:107) where P ASE ( · ) denotes a sequence of PASE embeddings, and x π ∗ u ( c ) ( t ) is the reference signal tied to the c th estimated sourceunder the best uPIT permutation. The PASE loss term enforces theframe permutations to align with the best utterance permutation π ∗ u .Previous works also show that conditioning source separationmodels with additional features improves performance [25–27], butwhether feature conditioning helps reducing permutation errors hasyet to be conﬁrmed. Towards this end, we extend the single-stageuPIT+PASE paradigm to a two-stage cascaded system. The idea isthat stage (i) separates the sources, and stage (ii) uses those estimatesto generate a conditioning for Conv-TasNet (with PASE features).As depicted in Fig. 3, where dashed lines denote stage (ii), stage(i) separates the speech sources using a Conv-TasNet trained withuPIT+PASE, and stage (ii) uses PASE (that can now be ﬁne-tuned)to extract features to condition a Conv-TasNet trained from scratch.Note that the Conv-TasNet of stage (ii) is guided by the separationsfrom the Conv-TasNet in step (i). Given that uPIT already ﬁnds thebest permutation in stage (i), no PIT is used for training in stage (ii).

4. EXPERIMENTS

We trained various models on the commonly used WSJ0 2-speaker(WSJ0-2mix) database [28]. The training set was created by ran-domly mixing utterances from 100 speakers at randomly selectedSNRs between 0 and 5 dB. Previous works found that models trainedon WSJ0-2mix might not generalize well to other datasets [13, 29].To also evaluate model generalization, we tested our models not onlyon the WSJ0-2mix test set (16 unseen speakers), but also on therecently released Libri-2mix (40 speakers) and VCTK-2mix (108speakers) test sets [29]. We used the clean version of those datasets,down-sampled at 8 kHz. For our uPIT+PASE experiments, we usedthe 16 kHz version since PASE was designed to work at this sam-pling rate. We trained all models with 4 sec speech segments. Forthe models trained with SI-SNR, we pre-processed the target signalsby variance normalization using the standard deviation of the mix-ture as in [23]. As a separator for Conv-TasNet models we used theTCN version by Tzinis et al. [23]. We used the ADAM optimizer with a learning rate of 1e-3, and divided the learning rate by 2 after5 consecutive epochs with no reduction in validation loss.

Our metrics measure: the quality of the separation (SI-SNRi), thepercentage of frame permutation errors (FER), and the percentageof “hard” examples (HSR) with an SI-SNRi less than 5 dB. FERand HSR metrics provide insights for studying permutation errors.SI-SNRi: the higher the better. FER and HSR: the lower the better. – SI-SNRi (dB) : the scale-invariant signal-to-noise ratio improve-ment measures the degree of separation of different models [22]. – FER (%) : frame error rate [2]. For each utterance, we countthe minimum percentage of inconsistent frame permutations withrespect to all possible utterance-level permutations (permute or not). – HSR (%) : hard-sample rate [5]. HSR measures the percentage of“hard” samples (with an SI-SNRi less than 5 dB). Informal listeningreveals that these samples contain many speaker swaps. We adjustedthe 5dB threshold so that it reﬂected permutation errors. Table 1 . SI-SNRi results of the tPIT+clustering algorithms. The“optimal clusters” use the target signals to reorder frames.

WSJ0 Libri VCTK uPIT-waveform 15.9 10.4 9.4uPIT-STFT 15.5 11.4 12.7tPIT-STFT + optimal clusters 18.5 16.0 15.5tPIT-STFT + clustering 17.5 13.9 13.6tPIT-time + optimal clusters 16.7 12.1 13.0tPIT-time + clustering 15.5 9.8 9.9tPIT-latent: enc/dec (Fig. 2, top) 55.5 54.9 53.9tPIT-latent + optimal clusters 17.6 12.9 13.7tPIT-latent + clustering 16.5 11.0 11.0 tPIT-latent + clustering: clustering loss variants pairwise similarity loss 16.2 10.7 10.8GE2E loss 16.5 11.0 11.0

In our clustering model, the latent features E and ˆS , ˆS ﬁrst gothrough a 1-D batch normalization, and are concatenated and linearlyprojected to 512 channels, from which two 1-D convolutional layers(kernel size 3 with PReLU) further process those. The rest of themodel follows the TCN architecture of the clustering model in DeepCASA [2]. Finally, a linear layer generates a 40-dim embedding(whose L norm is normalized to 1).• Spectrogram vs. waveform-based models for tPIT+clustering .In Table 1, we compare Conv-TasNet models with two STFT-based models: uPIT-STFT (the uPIT-trained variant of DeepCASA), and tPIT-STFT+clustering (Deep CASA). We trainedDeep CASA using the ofﬁcial code [30]. Note that spectrogram-based models generalize better than waveform ones. It’s knownthat models using signal processing transforms are less prone tooverﬁtting than those using learnable encoders/decoders [17], butthe optimal encoder/decoder results for tPIT-latent (obtained fol-lowing the training procedure depicted in Fig.2, top) suggest thatthe learned encoder/decoder does not overﬁt. Instead, it revealsthat the separator can be a key factor for model generalization. The best frame permutation is the one with the smallest spectral L distance, to be consistent with Deep CASA. We did not exclude silent frames. able 2 . SI-SNRi and FER of uPIT+PASE and its cascaded version for Conv-TasNet. Results with speech at 16kHz. uPIT-waveform uPIT+PASE uPIT+PASE cascaded tPIT-latent+clustering SI-SNRi FER SI-SNRi FER SI-SNRi FER SI-SNRi FERWSJ0 15.5 5.2 15.9 4.5 17.5 4.6 16.0 4.3Libri 10.7 9.0 10.8 8.0 11.9 7.6 11.1 7.8VTCK 9.5 12.4 9.9 11.2 10.9 11.3 10.9 9.5

Table 3 . Error analysis results: tPIT+clustering. uPIT tPIT-latent tPIT-STFT(waveform) + clustering + clustering

FER HSR FER HSR FER HSRWSJ0 6.1 6.0 5.4 1.8 4.9 2.2Libri 9.4 14.8 8.5 9.1 6.6 7.4VTCK 12.3 22.8 9.4 10.7 7.8 7.2Further, when looking at the error analysis in Table 3, note thattPIT-STFT+clustering (Deep CASA) FER scores are the lowest.Deep CASA uses much longer frames (32 ms) than that of Conv-TasNet (2 ms), which might enable more accurate frame-basedspeaker separation (indicated by the tPIT-STFT+optimal clustersresult in Table 1). Although short frames seem to be critical forachieving state-of-the-art results with waveform models [12], theyalso pose challenges related to permutation errors. These resultsdenote that Deep CASA (a STFT-based model) is effective at re-ducing permutation ambiguity and improving generalization.• tPIT+clustering for Conv-TasNet . In Table 1, we study thetPIT+clustering extensions we propose for Conv-TasNet: tPIT-time and tPIT-latent. The uPIT-waveform baseline does not gen-eralize, note the large performance drop on the other two datasets.Further, tPIT-latent+clustering achieves improvements that gener-alize. However, tPIT-time does not yield as good performance astPIT-latent, showing the advantage of tPIT in the latent space.Table 3 depicts our error analysis, where tPIT-latent+clusteringoutperforms uPIT-waveform. Fig. 4 shows the histogram of theVCTK-2mix SI-SNRi results. The tPIT-latent+clustering methodpushes the distribution towards the right, signiﬁcantly improvingon “hard” examples (with SI-SNRi < GE2E loss . To study the GE2E loss we propose, we trained thesame tPIT-latent+clustering model with the pairwise similarityloss in Deep CASA. Since their loss is memory intensive, to ﬁtinto GPU memory, for each forward pass, we compute and backpropagate the loss 4 times (each over a 1 sec segment). Besidesbeing memory and computationally efﬁcient, the GE2E loss alsoprovides better results (Table 1).

Fig. 4 . Histogram: SI-SNRi (dB) results of tPIT-latent+clusteringon VCTK-2mix. Red line: 5dB threshold deﬁning “hard” samples.

We used the PASE+ model [31] pre-trained with 50 hours of Lib-riSpeech (2338 speakers). It generates a 256-dim embedding every10 ms. In the second stage of our cascaded system, each depth-wise convolution in Conv-TasNet’s separator is conditioned with thePASE features of the separated speakers through a FiLM [32] layer.The weights in FiLM are shared across all layers, since this yieldedbetter results than using dedicated FiLM weights per layer.The results of our experiments are listed in Table 2. Adding thePASE term to uPIT (single stage) improves SI-SNRi and reducespermutation error over all test sets. However, its improvements aresmaller than those obtained by the tPIT-latent+clustering approach,implying the advantage of explicitly learning to permute with theclustering step. Also, note that the cascaded system we propose im-proves SI-SNRi across all test sets, but this is not the case for FER.This result suggests that the cascaded approach does not help amend-ing the permutation errors introduced by uPIT in the ﬁrst-stage.

Wavesplit [5], a recent WaveNet-like model, improved SI-SNRi to21 dB by combining tPIT+clustering and speaker conditioning, de-noting the potential of these two directions that we also ﬁnd promis-ing. It also reduced HSR (based on a 10 dB threshold) to 5.6%. Wealso measured the “10 dB HSR” for our tPIT-latent+clustering Conv-TasNet and obtained 5.0%—the 5 dB threshold was used in previousresults to better reﬂect permutation errors.Prob-PIT [7] considers the probabilities of all utterance levelpermutations, rather than just the best one, improving the initialtraining stage when wrong alignments are likely to happen. A similaridea is employed by Yang et al. [8], who trained a Conv-TasNet withuPIT and ﬁxed alignments in turns, reporting 17.5 dB SI-SNRi. Theyalso implemented Prob-PIT for Conv-TasNet and obtained 15.9 dB.Our SI-SNRi results (16.5 and 17.5 dB) are promising. Further, wealso conﬁrmed the reduction of permutation errors and generaliza-tion of improvements to other test sets, which was not tested in [7,8].

5. CONCLUSION

We explored two PIT directions to tackle the permutation ambi-guity problem: tPIT+clustering and uPIT+PASE. tPIT+clusteringwas originally proposed in the STFT domain by Deep CASA. Weadapted it to work with Conv-TasNet, a waveform-based model.Our extensions include tPIT-time, optimizing tPIT on the waveformdomain, and tPIT-latent, optimizing tPIT over a learned latent space.Further, the original clustering model by Deep CASA does not scalefor waveforms, and we propose using the GE2E loss that is efﬁcientand obtains better results. Our results on training Conv-TasNetwith tPIT+clustering show that tPIT-latent outperforms tPIT-timeand uPIT: it generalizes to other test sets and obtains less permu-tation errors. We also found that, although uPIT+PASE reducedpermutation errors, it was not as effective as tPIT+clustering. Also,tPIT-STFT+clustering (Deep CASA) is more effective at improvinggeneralization and reducing permutation ambiguity than the stud-ied waveform-based models. This result provides a new perspec-tive when comparing spectrogram- and waveform-based models:spectrogram-based models can help reducing permutation errors. . REFERENCES [1] Morten Kolbæk, Dong Yu, Zheng-Hua Tan, and Jesper Jensen,“Multitalker speech separation with utterance-level permuta-tion invariant training of deep recurrent neural networks,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 25, no. 10, pp. 1901–1913, 2017.[2] Yuzhou Liu and DeLiang Wang, “Divide and conquer: A deepcasa approach to talker-independent monaural speaker separa-tion,”

IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 27, no. 12, pp. 2092–2102, 2019.[3] Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen,“Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in

ICASSP , 2017,pp. 241–245.[4] Yuzhou Liu and DeLiang Wang, “A casa approach to deeplearning based speaker-independent co-channel speech separa-tion,” in

ICASSP , 2018, pp. 5399–5403.[5] Neil Zeghidour and David Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,” arXiv preprintarXiv:2002.08933 , 2020.[6] Chenglin Xu, Wei Rao, Xiong Xiao, Eng Siong Chng, andHaizhou Li, “Single channel speech separation with con-strained utterance level permutation invariant training usinggrid lstm,” in

ICASSP , 2018, pp. 6–10.[7] Midia Youseﬁ, Soheil Khorram, and John H.L. Hansen, “Prob-abilistic permutation invariant training for speech separation,”

Interspeech , pp. 4604–4608, 2019.[8] Gene-Ping Yang, Szu-Lin Wu, Yao-Wen Mao, Hung-yi Lee,and Lin-shan Lee, “Interrupted and cascaded permutation in-variant training for speech separation,” in

ICASSP , 2020, pp.6369–6373.[9] Eliya Nachmani, Yossi Adi, and Lior Wolf, “Voice separationwith an unknown number of multiple speakers,” arXiv preprintarXiv:2003.01531 , 2020.[10] Yi Luo and Nima Mesgarani, “Conv-TasNet: SurpassingIdeal Time–Frequency Magnitude Masking for Speech Sepa-ration,”

IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 27, no. 8, pp. 1256–1266, 2019.[11] Yi Luo and Nima Mesgarani, “Tasnet: time-domain audioseparation network for real-time, single-channel speech sep-aration,” in

ICASSP , 2018, pp. 696–700.[12] Yi Luo, Zhuo Chen, and Takuya Yoshioka, “Dual-pathrnn: efﬁcient long sequence modeling for time-domain single-channel speech separation,” in

ICASSP , 2020, pp. 46–50.[13] Berkan Kadıo˘glu, Michael Horgan, Xiaoyu Liu, Jordi Pons,Dan Darcy, and Vivek Kumar, “An empirical study of conv-tasnet,” in

ICASSP , 2020, pp. 7264–7268.[14] Santiago Pascual, Mirco Ravanelli, Joan Serr`a, Antonio Bona-fonte, and Yoshua Bengio, “Learning problem-agnostic speechrepresentations from multiple self-supervised tasks,”

Inter-speech , pp. 161–165, 2019.[15] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, PawelSwietojanski, Joao Monteiro, Jan Trmal, and Yoshua Bengio,“Multi-task self-supervised learning for robust speech recogni-tion,” in

ICASSP , 2020, pp. 6989–6993.[16] David ´Alvarez, Santiago Pascual, and Antonio Bonafonte,“Problem-agnostic speech embeddings for multi-speaker text- to-speech with SampleRnn,” in

ISCA Speech Synthesis Work-shop , 2019, pp. 35–39.[17] David Ditter and Timo Gerkmann, “A multi-phase gammatoneﬁlterbank for speech separation via tasnet,” in

ICASSP , 2020,pp. 36–40.[18] Jens Heitkaemper, Darius Jakobeit, Christoph Boeddeker,Lukas Drude, and Reinhold Haeb-Umbach, “Demystifyingtasnet: A dissecting approach,” in

ICASSP , 2020, pp. 6359–6363.[19] Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton,Kevin Wilson, Jonathan Le Roux, and John R Hershey, “Uni-versal sound separation,” in

WASPAA , 2019, pp. 175–179.[20] Manuel Pariente, Samuele Cornell, Antoine Deleforge, andEmmanuel Vincent, “Filterbank design for end-to-end speechseparation,” in

ICASSP , 2020, pp. 6364–6368.[21] Fahimeh Bahmaninezhad, Jian Wu, Rongzhi Gu, Shi-XiongZhang, Yong Xu, Meng Yu, and Dong Yu, “A comprehensivestudy of speech separation: Spectrogram vs waveform separa-tion,”

Interspeech , pp. 4574–4578, 2019.[22] Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John RHershey, “Sdr–half-baked or well done?,” in

ICASSP , 2019,pp. 626–630.[23] Efthymios Tzinis, Shrikant Venkataramani, Zhepei Wang, CemSubakan, and Paris Smaragdis, “Two-step sound source sepa-ration: Training on learned latent targets,” in

ICASSP , 2020,pp. 31–35.[24] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno,“Generalized end-to-end loss for speaker veriﬁcation,” in

ICASSP , 2018, pp. 4879–4883.[25] Naoya Takahashi, Mayank Kumar Singh, Sakya Basak,Parthasaarathy Sudarsanam, Sriram Ganapathy, and Yuki Mit-sufuji, “Improving voice separation by incorporating end-to-end speech recognition,” in

ICASSP , 2020, pp. 41–45.[26] Rongzhi Gu, Jian Wu, Shi-Xiong Zhang, Lianwu Chen, YongXu, Meng Yu, Dan Su, Yuexian Zou, and Dong Yu, “End-to-end multi-channel speech separation,” arXiv preprintarXiv:1905.06286 , 2019.[27] Efthymios Tzinis, Scott Wisdom, John R Hershey, ArenJansen, and Daniel PW Ellis, “Improving universal sound sep-aration using sound classiﬁcation,” in

ICASSP , 2020, pp. 96–100.[28] Script to generate the multi-speaker dataset using WSJ0 .[29] Joris Cosentino, Manuel Pariente, Samuele Cornell, AntoineDeleforge, and Emmanuel Vincent, “Librimix: An open-source dataset for generalizable speech separation,” arXivpreprint arXiv:2005.11262 , 2020.[30] The Deep CASA results reported in this paper are based on thiscode: “https://github.com/yuzhou-git/deep-casa” .[31] The PASE+ model used in this paper can be retrieved here: “https://github.com/santi-pdp/pase” .[32] Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin,and Aaron Courville, “Film: Visual reasoning with a generalconditioning layer,” in