Exploring the time-domain deep attractor network with two-stream architectures in a reverberant environment
EExploring the time-domain deep attractor network with two-streamarchitectures in a reverberant environment
Hangting Chen a,b , Pengyuan Zhang a,b , ∗ a Key Laboratory of Speech Acoustics & Content Understanding, Institute of Acoustics, CAS, China b University of Chinese Academy of Sciences, Beijing, China
A R T I C L E I N F O
Keywords :Speech separationDereverberationDeep attractor networkTime-domain network
A B S T R A C T
With the success of deep learning in speech signal processing, speaker-independent speech separationunder a reverberant environment remains challenging. The deep attractor network (DAN) performsspeech separation with speaker attractors on the time-frequency domain. The recently proposed con-volutional time-domain audio separation network (Conv-TasNet) surpasses ideal masks in anechoicmixture signals, while its architecture renders the problem of separating signals with arbitrary numbersof speakers. Moreover, these models will suffer performance degradation in a reverberant environ-ment. In this study, we propose a time-domain deep attractor network (TD-DAN) with two-streamconvolutional networks that efficiently performs both dereverberation and separation tasks under thecondition of variable numbers of speakers. The speaker encoding stream (SES) of the TD-DAN mod-els speaker information, and is explored with various waveform encoders. The speech decoding steam(SDS) accepts speaker attractors from SES, and learns to predict early reflections. Experiment resultsdemonstrated that the TD-DAN achieved scale-invariant source-to-distortion ratio (SI-SDR) gains of . . dB and . . dB on the reverberant two- and three-speaker development/evaluationset, exceeding Conv-TasNet by . . dB and . . dB, respectively.
1. Introduction
The speech signal captured by distant microphones oftenpresents reverberation, noise, and multiple speakers, render-ing the low speech intelligibility for human listeners. In suchsituations, obtaining the single-speaker close-talk signal re-quires the ability to perform dereverberation and source sep-aration, with noise being viewed as a special source.Despite the great success of speech separation on theclean close-talk utterances, blind separation in a reverber-ant environment remains challenging. Most studies perfor-mance the dereverberation and separation with tandem sys-tems, each part of which is designed for a single tasks. Theframework in (Nakatani, Takahashi, Ochiai, Kinoshita, Ikeshita,Delcroix & Araki (2020)) integrates deep learning-based speechseparation, statistical model-based dereverberation and beam-forming. Another study in (Fan, hua Tao, Liu, Yi & Wen(2020)) cascades networks to learn different targets, whichobtained better performance compared with directing map-ping from the reverberated mixture to the clean signal. Fewstudies have been conducted to extend the separation algo-rithms into the reverberant multi-speaker signals within aunified deep learning-based framework.More recently, the time-domain audio separation network(TasNet) has provided a novel separation scheme that workson the time-domain representations with the time-domainconvolutional encoder and decoder (Luo & Mesgarani (2018)).The subsequent Conv-TasNet (Luo & Mesgarani (2019)) andother works (Shi, Lin, Liu, Liu, Hayakawa & Han (2019);Bahmaninezhad, Wu, Gu, Zhang, Xu, Yu & Yu (2019)) have ∗ Corresponding author [email protected] (H. Chen); [email protected] (P. Zhang)
ORCID (s): (H. Chen) demonstrated significant separation performance that evenexceeds that of the ideal T-F masks. However, the archi-tecture and permutation invariant training limits its flexibil-ity, i.e., it can only be deployed to separate a fixed num-ber of speakers. On the other hand, the deep attractor net-work (DAN) (Luo, Chen & Mesgarani (2018)) predicts fre-quency domain masks by first generating high-level speakerembeddings, and then forming an attractor to extract the tar-get speaker’s sound. To the best of the author’s knowledge,the time-domain deep attractor network has not been studied.In this study, we propose a novel time-domain deep at-tractor network (TD-DAN) for simultaneously performingboth dereverberation and separation. The designed architec-ture consists of 2 parallel streams, a speaker encoding stream(SES) for speaker embedding modeling and a speech decod-ing stream (SDS) for target speaker extraction and derever-beration. The SES is trained with the reconstruction lossand the concentration loss, resulting in speaker embeddingssuitable for clustering. Meanwhile, the SDS serves as aninference module which first models the speaker informa-tion using an approach similar to that of the SES, and theninteracts with the SES to output waveforms. The proposedscheme makes the following contributions:• The TD-DAN is introduced with a two-stream archi-tecture, which performs dereverberation and separa-tion simultaneously.• A concentration loss is employed to bridge the gap be-tween the oracle attractor and K-Means.• The TD-DAN is explored with different waveform en-coders on the 2-speaker reverberant mixtures as wellas the more challenging 3-speaker reverberant signals.
Hangting Chen et al.:
Preprint submitted to Elsevier
Page 1 of 10 a r X i v : . [ ee ss . A S ] A ug ime-domain deep attractor network We have found that the TD-DAN can perform dereverber-ation and separation simultaneously, exceeding the Conv-TasNet by . . dB and . . dB on the 2- and3-speaker development/evaluation (Dev./Eval.) set, respec-tively.The rest of the paper is organized as follows. In Section2, we provide techniques related to the proposed method. InSection 3, we describe the proposed time-domain deep at-tractor network. Section 4 presents the experimental resultsof the proposed methods. In Section 5, we present our con-clusions and future plans.
2. Related work
Previous work of far-field speech separation focused onthe following issues: dereverberation, speech separationand the unified framework. Dereverberation:
To address the dereverberation prob-lem, many algorithms have been proposed, such as beam-forming (Schwartz, Gannot & Habets (2016); Kodrasi & Do-clo (2017); Nakatani & Kinoshita (2019a)) and blind inversefiltering (Schmid, Malik & Enzner (2012); Yoshioka & Nakatani(2012)). Weighted prediction error (WPE) is developed un-der the paradigm of blind inverse filtering, which rose toprominence in the REVERB challenge (Kinoshita, Delcroix,Gannot, Habets, Haeb-Umbach, Kellermann, Leutnant, Maas,Nakatani, Raj, Sehr & Yoshioka (2016)). It aims to min-imize the weighted prediction error by optimizing the de-layed linear filters to eliminate the detrimental late rever-beration (Nakatani, Yoshioka, Kinoshita, Miyoshi & Juang(2010)). Deep neural networks (DNNs) have been used tolearn spectral mapping from reverberant signals to anechoicones (Geetha & AYATHRI (2017)). In practice, the maskestimation is preferred for its better performance over di-rect mapping (Wang, Narayanan & Wang (2014)). More-over, complex ideal ratio masks are proposed to overcomethe drawback that the real-number masks cannot reconstructthe phase information of the target signal (Williamson &Wang (2017)). Some researchers attempt to combine theDNNs with WPE by deep learning-based energy varianceestimation, leading to a non-iterative WPE algorithm (Hey-mann, Drude, Haeb-Umbach, Kinoshita & Nakatani (2019)).
Paradigms of speech separation:
Most architecturesadopt paradigms, permutation invariant training (PIT) andspeaker clustering-based methods. The PIT (Kolbaek, Yu,Tan & Jensen (2017)) directly optimizes the reconstructionloss with possible permutations. The drawback of PIT isthat its architecture cannot deal with a variable number ofspeakers. The speaker clustering methods like DC (Hershey,Chen, Roux & Watanabe (2016); qiu Wang, Roux & Her-shey (2018a)) are trained to generate discriminative speakerembeddings in each T-F bins, and use cluster algorithms toobtain speaker assignment during the test phase. The DANis developed following the DC, but it directly optimizes thereconstruction of the spectrogram (Luo et al. (2018)). DCand DAN can deal with a variable number of speakers bysetting the cluster number. Speech extraction differs from speech separation with prior knowledge of the target speaker.For instance, SpeakerBeam extracts the voice of the targetspeaker with a pre-collected enrollment (Zmolíková, Del-croix, Kinoshita, Ochiai, Nakatani, Burget & Ćernocký (2019)). Learning objects of speech separation:
Most previousapproaches have been formulated by predicting T-F masksof the mixture signal. The commonly used masks are IBMs,ideal ratio masks (IRMs) and Wiener filter-like masks (WFMs)(Wang et al. (2014)). Some approaches directly approxi-mate the spectrogram of each sources (Du, Tu, Dai & Lee(2016)). Both the mask estimation and spectrum predictionuse the inverse short-time Fourier transform (iSTFT) of theestimated magnitude spectrogram of each source togetherwith the original or the modified phase. Recently, TasNet in-troduced a novel method for separating signals from the rawwaveform. It utilizes -D convolutional filters to perform en-coding and decoding on the generated spectro-temporal rep-resentations. A speech separation module accepts the repre-sentation and predicts each source’s masks. Unlike the fixedweights of the short-time Fourier transform (STFT), TasNetlearns the transform weight by optimizing scale-invariantsource-to-distortion ratios (SI-SDRs) between the estimatedand target source signals. Yet as the TasNet can only sepa-rate a fixed number of speakers, it is less flexible than DCand DAN. Thus, how time-domain speaker clustering-basedseparation can best be achieved is a problem that has yet tobe resolved. Unified frameworks:
Speech separation in a reverber-ant environment is a difficult task by simultaneously address-ing the dereverberation and separation problems. Most ofthe systems adopt algorithms in tandem, for example, theframework in (Nakatani et al. (2020)) combines WeightedPower minimization Distortionless response (WPD) (Nakatani& Kinoshita (2019b)), noisy Complex Gaussian Mixture Model(noisyCGMM) (Ito, Schymura, Araki & Nakatani (2018)),and CNN-based PIT. A purely deep learning-based networkis introduced for denoising and dereverberation by first learn-ing the noise-free deep embeddings and then performing mask-based dereverberation (Fan et al. (2020)). TasNet achieveda SI-SDR of . dB in WHAMR!, a reverberant version ofWHAM!, whereas it obtained . dB in the clean WSJ0-2MIX dataset (Maciejewski, Wichern, McQuinn & Roux(2019)). The cascaded framework of separation and dere-verberation improves the performance to . dB, much lowerthan that in the clean situation. The performance results ofWHAMR! indicate that simultaneous dereverberation andseparation exhibits a difficult problem. Moreover, a muchmore challenging task of separating more than 2 speakers ina reverberant environment, has not been previously explored.
3. Methods
In this section, we first formulate the problem and intro-duce the baseline DAN and Conv-TasNet. Following the de-sign of the speaker attractor and time convolutional network,2 types of two-stream TD-DANs are put forward, one withhybrid encoders and another one with fully time-domain wave-
Hangting Chen et al.:
Preprint submitted to Elsevier
Page 2 of 10ime-domain deep attractor network form encoders. Additionally, a clustering loss is proposed toimprove the performance of K-Means.
Assume that speech signals from 𝐾 speakers are cap-tured by a distant microphone in a noisy reverberant envi-ronment. The captured signal is 𝑦 = 𝐾 ∑ 𝑘 =1 𝑦 ( 𝑘 ) + 𝑛 = 𝐾 ∑ 𝑘 =1 𝑑 ( 𝑘 ) + 𝐾 ∑ 𝑘 =1 𝑟 ( 𝑘 ) + 𝑛, (1)where 𝑛 is the noise, 𝑦 ( 𝑘 ) is the reverberant source signal, de-composed as 𝑑 ( 𝑘 ) representing the direct sound and early re-flection, and 𝑟 ( 𝑘 ) representing the late reverberation, respec-tively. For simplicity, 𝑑 ( 𝑘 ) is referred to as the early reflectionin the following. STFT transforms the signal from the timedomain to T-F representations, reformulating Eq.(1) as 𝑦 𝑡,𝑓 = 𝐾 ∑ 𝑘 =1 𝑦 𝑘,𝑡,𝑓 + 𝑛 𝑡,𝑓 = 𝐾 ∑ 𝑘 =1 𝑑 𝑘,𝑡,𝑓 + 𝐾 ∑ 𝑘 =1 𝑟 𝑘,𝑡,𝑓 + 𝑛 𝑡,𝑓 , (2)with the frame number 𝑇 , maximum frequency index 𝐹 ,frame index 𝑡 = 1 , ..., 𝑇 and frequency index 𝑓 = 0 , ..., 𝐹 .The early reflection 𝑑 𝑘,𝑡,𝑓 and the late part 𝑟 𝑘,𝑡,𝑓 are gener-ated by convolution, 𝑑 𝑘,𝑡,𝑓 = 𝐷 −1 ∑ 𝜏 =0 𝑎 𝑘,𝜏,𝑓 𝑠 𝑘,𝑡 − 𝜏,𝑓 , (3) 𝑟 𝑘,𝑡,𝑓 = 𝐿 𝑎 −1 ∑ 𝜏 = 𝐷 𝑎 𝑘,𝜏,𝑓 𝑠 𝑘,𝑡 − 𝜏,𝑓 , (4)where 𝑎 𝑘,𝑓 = [ 𝑎 𝑘, ,𝑓 , 𝑎 𝑘, ,𝑓 , ..., 𝑎 𝑘,𝐿 𝑎 −1 ,𝑓 ] is the transfer func-tion with late reverberation starting from frame 𝐷 and end-ing at frame 𝐿 𝑎 for frequency 𝑓 , and 𝑠 𝑘,𝑡,𝑓 is the source sig-nal for speaker 𝑘 in bin 𝑡, 𝑓 . As indicated in (Bradley, Sato& Picard (2003)), the early reflections increase the speechintelligibility scores for both impaired and non-impaired lis-teners. Thus, in this study, dereverberation is to eliminatethe late part 𝑟 𝑘,𝑡,𝑓 .The ideal masks are defined on the frequency domainover the sources. The IRM for speech separation only is ex-pressed as 𝑚 IRM(sepr) 𝑘,𝑡,𝑓 = | 𝑦 𝑘,𝑡,𝑓 |∑ 𝑘 | 𝑦 𝑘,𝑡,𝑓 | + | 𝑛 𝑡,𝑓 | , (5)where | ⋅ | is a modulus operation. In the reverberant envi-ronment, the IRM for dereverberated source 𝑘 is redefinedas 𝑚 IRM(sepr+derevb) 𝑘,𝑡,𝑓 = | 𝑑 𝑘,𝑡,𝑓 || 𝑦 𝑡,𝑓 − 𝑑 𝑘,𝑡,𝑓 | + | 𝑑 𝑘,𝑡,𝑓 | , (6)where the interference signal is obtained by removing theearly part of source 𝑑 𝑘,𝑡,𝑓 , i.e., it includes both the late rever-beration of the target source and other interference signals.Similarly, WFM is formulated as 𝑚 WFM(sepr) 𝑘,𝑡,𝑓 = √ | 𝑦 𝑘,𝑡,𝑓 | ∑ 𝑘 | 𝑦 𝑘,𝑡,𝑓 | + | 𝑛 𝑡,𝑓 | , (7) Figure 1:
The architecture of DAN, where the attractor isobtain by oracle assignment and K-Means in the training andtesting, respectively. 𝑚 WFM(sepr+derevb) 𝑘,𝑡,𝑓 = √ | 𝑑 𝑘,𝑡,𝑓 | | 𝑦 𝑡,𝑓 − 𝑑 𝑘,𝑡,𝑓 | + | 𝑑 𝑘,𝑡,𝑓 | . (8) Our TD-DAN is inspired by the design of the speakerattractor and the time convolutional network (TCN) origi-nally proposed in DAN (Luo et al. (2018)) and Conv-TasNet(Luo & Mesgarani (2019)), respectively. We briefly intro-duce these 2 networks in the following.
The attractor is a speaker embedding indicating speakerinformation. As shown in Fig.(1), DAN accepts log powerspectrum (LPS) and generates 𝐷 -dimensional speaker em-beddings 𝐚 𝑡,𝑓 , { 𝐚 𝑡,𝑓 } 𝑡,𝑓 = DAN ( Enc
LPSDAN ( 𝑦 )) , (9)where { ⋅ } { ⋅ } denotes the matrix form with subscripts repre-senting the axes, and Enc LPS { ⋅ } is the LPS feature extractor. Inthe training, the attractor vector 𝐚 𝑘 for speaker 𝑘 is obtainedby averaging over the T-F bins, 𝐚 𝑘 = ∑ 𝑡,𝑓 𝑚 IBM 𝑘,𝑡,𝑓 𝑣 𝑡,𝑓 𝐚 𝑡,𝑓 ∑ 𝑡,𝑓 𝑚 IBM 𝑘,𝑡.𝑓 𝑣 𝑡,𝑓 , (10)where 𝑣 𝑡,𝑓 ∈ {0 , denotes the presence/absence of speechcalculated by a threshold of power, 𝑚 IBM 𝑘,𝑡,𝑓 is the binary speakerassignment. Here, we use source signals to calculate 𝑚 IBM 𝑘,𝑡,𝑓 𝑚 IBM 𝑘,𝑡,𝑓 = { , if | 𝑑 𝑘,𝑡,𝑓 | < ∑ 𝑞 ≠ 𝑘 | 𝑑 𝑞,𝑡,𝑓 | , if | 𝑑 𝑘,𝑡,𝑓 | ⩾ ∑ 𝑞 ≠ 𝑘 | 𝑑 𝑞,𝑡,𝑓 | , (11)since 𝐚 𝑡,𝑓 is expected to indicate the source information andcan be used to perform both separation and dereverberation. Hangting Chen et al.:
Preprint submitted to Elsevier
Page 3 of 10ime-domain deep attractor network
Figure 2:
The architecture of Conv-TasNet, where the sepa-ration module output separation masks { ̂𝑚 𝑘,𝑡,𝑓 } 𝑘,𝑡,𝑓 for a fixednumber of speakers. During the test phase, the attractors are obtained by K-Meansclustering with prior knowledge of the speaker number, { 𝐚 𝑘 } 𝑘 = KMeans ({ 𝐚 𝑡,𝑓 | if 𝑣 𝑡,𝑓 = 1}) . (12)The masks are estimated with Sigmoid activation, ̂𝑚 MRM 𝑘,𝑡,𝑓 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 ( 𝐚 𝑇𝑘 𝐚 𝑡,𝑓 ) , (13)where 𝐚 𝑘 ∈ ℝ 𝐷 ×1 is the 𝐷 -dimensional attractor of speaker 𝑘 . The DAN is trained by minimizing the reconstruction lossfor both separation and dereverberation, 𝐿 r = ∑ 𝑘,𝑡,𝑓 ( 𝑦 𝑡,𝑓 ̂𝑚 MRM 𝑘,𝑡,𝑓 − 𝑑 𝑘,𝑡,𝑓 ) . (14)The optimization leads to speaker embeddings in which thevectors from the same speakers get closer and those fromdifferent speakers become more discriminative. However,due to 𝑦 𝑡,𝑓 ≠ ∑ 𝑘 𝑑 𝑘,𝑡,𝑓 , Eq.(14) may lead to performancedegradation in clustering, which can be solved by adding anextra concentration loss (Sec.3.4). The Conv-TasNet is a fully convolutional time-domainaudio separation network, composed of a 𝐷 convolutionalencoder, a separation module and a 𝐷 convolutional de-coder. Multiple sequential TCN blocks with various dila-tion factors are stacked as the separation module. The fullyconvolutional architectures result in a small model size andcan be deployed in a causal way. As plotted in Fig.(2), theencoder encodes the input mixture signal, { 𝑦 𝑡,𝑓 } 𝑡,𝑓 = Enc
FreeTasNet ( 𝑦 ) , (15)where Enc Free { ⋅ } is the 𝐷 time convolutional kernel, 𝑦 𝑡,𝑓 is thespectro-temporal representation. We use “Free” to indicate that the kernel parameters are learnable. The TCN-basedseparation module is trained to predict masks, { ̂𝑚 TD 𝑘,𝑡,𝑓 } 𝑘,𝑡,𝑓 = TCN ({ 𝑦 𝑡,𝑓 } 𝑡,𝑓 ) , (16)where ̂𝑚 TD 𝑘,𝑡,𝑓 is the estimated mask defined on the spectro-temporal representation. The decoder decodes the maskedspectro-temporal representation and outputs the enhancedwaveforms, { ̂𝑑 } 𝑘 = Dec
FreeTasNet ({ 𝑦 𝑡,𝑓 ̂𝑚 TD 𝑘,𝑡,𝑓 } 𝑘,𝑡,𝑓 ) , (17)where Dec FreeTasNet is the 𝐷 time-domain kernel. The Conv-TasNet is trained to optimize the SI-SDR. Utterance-levelpermutation invariant training (uPIT) is deployed to addressthe source permutation problem in the training phase (Kol-baek et al. (2017)). The TD-DAN is a two-stream architecture, a speaker en-coding stream (SES) for speaker embedding modeling and aspeech decoding stream (SDS) for dereverberation and speakerextraction. We creatively separate the task into parts andjointly train streams with a multi-task loss. We first de-scribe the two-stream architecture together with the hybridwaveform encoders, and then step into the fully time-domainencoders. As plotted in Fig.(3), the SES is similar with the DANnetwork, which accepts the log power spectral (LPS) withEnc
LPSSES , and calculates the masks with the speaker embed-dings and attractors. The whole feed-forward procedure fol-lows Eq.(9)-(14).The SDS models the input signal with a 𝐷 convolu-tional encoder and stacked TCNs, which can be viewed as adereverberation process, { 𝐞 𝑡,𝑓 } = TCN ( Enc
FreeSDS ( 𝑦 )) . (18)where 𝐞 𝑡,𝑓 ∈ ℝ 𝐸 ×1 is the 𝐸 -dimensional high-level repre-sentation. The SES interacts with the SDS through a lineartransformation of the attractor. The SDS accepts the trans-formed attractor to calculate the masks and finally outputsthe dereverberated and separated signal, ̂𝑚 TD 𝑘,𝑡,𝑓 = ReLU (( Linear ( 𝐚 𝑘 )) 𝑇 𝐞 𝑡,𝑓 ) , (19) ̂𝑑 𝑘 = Dec
FreeSDS ({ ̂𝑚 𝑇 𝐷𝑘,𝑡,𝑓 𝐞 𝑡,𝑓 } 𝑡,𝑓 ) , (20)where Linear is the linear transformation communicating theSES and SDS. The model is trained from raw by optimizinga multi-task loss, 𝐿 TD-DAN = 𝐿 SI-SDR ( ̂𝑑 𝑘 , 𝑑 𝑘 ) + 𝛼 𝑟 𝐿 r ( ̂𝑚 MRM 𝑘,𝑡,𝑓 ) , (21)where 𝛼 𝑟 is the loss balance factor.This TD-DAN is with hybrid encoders because the SESis encoded by the STFT transform, while the SDS is encodedby a 𝐷 convolutional encoder with free kernels. Neverthe-less, it is designated as a time-domain DAN since it is trainedto predict waveform directly. Hangting Chen et al.:
Preprint submitted to Elsevier
Page 4 of 10ime-domain deep attractor network
Figure 3:
The architecture of TD-DAN, composed of SES ans SDS. The waveform encoder of SES can adopt frequency-domainLPS transform, time-domain stacked STFT kernels or free kernels.
Here, we replace the waveform encoder Enc
LPSSES in TD-DAN SES with time-domain convolutional kernels. The prob-lem is the definition of the IBMs in the spectro-temporalrepresentations, which is originally computed based on thespectrogram (Eq.(11)). In formulation, the time-domain SESencoder Enc
TDSES encodes the mixture signal into 𝑦 𝑡,𝑓 , { 𝑦 𝑡,𝑓 } = Enc
TDSES ( 𝑦 ) . (22)By setting the magnitude of the signal as | 𝑦 𝑡,𝑓 | , its IBM isformulated similarly, 𝑚 IBM 𝑘,𝑡,𝑓 = { , if | Enc
TDSES ( 𝑑 𝑘,𝑡,𝑓 ) | < ∑ 𝑞 ≠ 𝑘 | Enc
TDSES ( 𝑑 𝑘,𝑡,𝑓 ) | , if | Enc
TDSES ( 𝑑 𝑘,𝑡,𝑓 ) | ⩾ ∑ 𝑞 ≠ 𝑘 | Enc
TDSES ( 𝑠 𝑘,𝑡,𝑓 ) | (23)We introduce time-domain kernels, the stacked time-domain STFT kernel and the free kernel :1) The STFT convolutional encoder Enc STFTSES . The STFTtransform is split into the real and the imaginary partwith stacked convolutional kernel expressed as, 𝐾 𝑐𝑜𝑠𝑓 [ 𝑛 ] = 𝑤 [ 𝑛 ] 𝑐𝑜𝑠 (2 𝜋𝑛𝑓 ∕ 𝑁 ) , (24) 𝐾 𝑠𝑖𝑛𝑓 [ 𝑛 ] = 𝑤 [ 𝑛 ] 𝑠𝑖𝑛 (2 𝜋𝑛𝑓 ∕ 𝑁 ) , (25) 𝐊 𝐒𝐓𝐅𝐓 = [ 𝐊 𝑐𝑜𝑠 , ..., 𝐊 𝑐𝑜𝑠𝐹 −1 , 𝐊 𝑐𝑜𝑠𝐹 , 𝐊 𝑠𝑖𝑛 , ..., 𝐊 𝑠𝑖𝑛𝐹 −1 ] , (26) where 𝐹 usually equals to 𝑁 ∕2 , columns of 𝐊 𝐒𝐓𝐅𝐓 are 𝐷 convolutional kernels, 𝑛 is the sample index ina convolutional kernel of size 𝑁 , 𝑓 = 0 , , ..., 𝐹 is thekernel index corresponding to the frequency of STFT,and 𝑤 is the pre-designed analysis window. This ker-nel is different from STFT, since it stacks the real andthe imaginary part of the spectrum, which can be con-ducted with real-value convolutional operations.2) The free convolutional encoder Enc FreeSES , whose 𝐷 con-volutional kernel 𝐊 𝐅𝐫𝐞𝐞 is trained together with thedereverberation and separation tasks.The whole procedure with fully time-domain encodersfollows Fig.(3), where the attractor is obtained by masks de-fined by Enc
TDSES and is calculated by Eq.(10)-(14), and thedereverberation and separation are conducted following Eq.(18)-(20). The speech presence assign 𝑣 𝑡,𝑓 is obtained by a thresh-old of the magnitude of the spectro-temporal representations.The network is trained to optimize the multi-task loss (Eq.(21)). The reconstruction loss (Eq.(14)) indicates that the maskwill be close to if the T-F bin embeddings are close to thespeaker attractor, otherwise close to . The sparsity assump-tion declares that the observed signal contains at most onesource at each of the T-F bins, which ensures the clusteringperformance in the DAN since most embeddings are opti-mized so that they are closed to some attractor to achieve Hangting Chen et al.:
Preprint submitted to Elsevier
Page 5 of 10ime-domain deep attractor network P e r c e n t a g e ( % ) Figure 4:
The histogram of IRM calculated with the cleanand reverberant multi-speaker mixtures. The clean mixtureis mixed with early reflections, while the reverberant one ismixed with early and late reflections. The IRM is calculatedwith Eq.(6). binary-like masks. However, the reverberant signal may notfollow the sparsity assumption. The distribution of the mix-ture signal with early reflection and reverberation is plottedin Fig.(4). Notably, approximate
T-F bins have an IRMvalue larger than . in the mixture of early reflections,while in the reverberant signal, the percentage declines sig-nificantly to approximate . This occurs because the IRMof early reflections is the ratio of the target early part againstthe interference ones, while for Eq.(6) it is the ratio of the tar-get early reflection against the target late reverberation, theinterference early and late reverberation. The lack of high-value T-F bin masks indicates the difficulty of embeddingclustering.To achieve better clustering performance, we introducethe clustering loss, including the concentration loss and thediscriminative loss. The concentration loss is designed forall DAN-based models, expressed as, 𝐿 𝑐 = ∑ 𝑘,𝑡,𝑓 || 𝐚 𝑘 − 𝑚 IBM 𝑘,𝑡,𝑓 𝑣 𝑡,𝑓 𝐚 𝑡,𝑓 || . (27)Its gradient is 𝜕𝐿 𝑐 𝐚 𝑡,𝑓 = ⎧⎪⎨⎪⎩ −2(1 − ∑ 𝑡,𝑓 𝑚 IBM 𝑘,𝑡,𝑓 𝑣 𝑡,𝑓 )( 𝐚 𝑘 − 𝐚 𝑡,𝑓 ) , if 𝑚 IBM 𝑘,𝑡,𝑓 𝑣 𝑡,𝑓 = 10 , if 𝑚 IBM 𝑘,𝑡,𝑓 𝑣 𝑡,𝑓 = 0 (28)which enforces embedding 𝐚 𝑡,𝑓 close to attractor 𝐚 𝑘 whendominated by speaker 𝑘 . The within-class concentration lossmay be in conflict with Eq.(14) to some degree, i.e., theloss pushes the embeddings concentrated around the attrac-tors, which may lead to a suboptimal reconstruction loss. Inpractice, however, the joint optimization of the reconstruc-tion and the concentration loss leads to better performance,which will be illustrated in Sec.5.Another inter-class discrimination loss maximizes the dis-tance among different attractors, 𝐿 𝑑 = 𝑘 ≠ 𝑞 ∑ 𝑘,𝑞 || 𝐚 𝑘 − 𝐚 𝑞 || . (29) Table 1
The detailed numbers of train, development (Dev.) and eval-uation (Eval.) sets.Dataset Utterances( . . Dev.
982 9 . . Eval. . . In fact, Eq.(14) includes the optimization of discrimination,whereby the attractor distance will be enlarged if 𝑚 MRM 𝑡,𝑓 isclose to . The discrimination loss here is designed for freeconvolutional kernel 𝐊 𝐅𝐫𝐞𝐞 in the SES to avoid the point atwhich small | Enc
FreeSES ( 𝑦 ) | results in small 𝐿 𝑟 as well as am-biguous attractors of different speakers. The training loss isupdated to, 𝐿 TD-DAN = 𝐿 SI-SDR + 𝛼 𝑟 𝐿 r + 𝛼 𝑐 𝐿 c + 𝛼 𝑑 𝐿 d , (30)where 𝛼 𝑐 and 𝛼 𝑑 are factors for concentration and discrimi-nation losses.
4. Experimental configuration
The experiments were conducted on the Spatialized Multi-Speaker Wall Street Journal (SMS-WSJ) , which artificiallyspatialized and mixed utterances taken from WSJ (Drude,Heitkaemper, Böddeker & Haeb-Umbach (2019)). It differsfrom the spatialized version of the WSJ0-2MIX (qiu Wang,Roux & Hershey (2018b)) in that it considers all WSJ0+1utterances, and strictly separates the speakers for the train-ing, validation and test sets. The room impulse responsewas randomly sampled with different room sizes, array cen-ters, array rotation, and source positions. The sound decaytime (T60) was sampled uniformly from 𝑚𝑠 to 𝑚𝑠 .The simulated -channel audios contained early reflections( < 𝑚𝑠 ), late reverberation ( > 𝑚𝑠 ), and white noise. Thedetailed numbers of the dataset are listed in Table.1. Mean-while, we also simulated a -speaker dataset as a more chal-lenging task, which used the same RIRs and utterance splitas the SMS-WSJ dataset.In our experiments, we only used the first channel ofthe multi-channel signal. The networks were trained to mapthe reverberant multi-speaker signal to early reflections. Asdemonstrated in (Drude et al. (2019)), the early reflectionswere close to the source signal in the measurement of speechintelligibility. The experiments were conducted with Asteroid, an audiosource separation toolkit based on PyTorch (Paszke, Gross,Chintala, Chanan, Yang, Devito, Lin, Desmaison, Antiga &Lerer (2017)). The baseline Conv-TasNet and DAN followed https://github.com/fgnt/sms_wsj https://github.com/mpariente/asteroid Hangting Chen et al.:
Preprint submitted to Elsevier
Page 6 of 10ime-domain deep attractor network
Table 2
The model architectures with TCN hyper-parameter 𝐵 ∕ 𝐻 ∕ 𝑃 ∕ 𝑋 ∕ 𝑅 , the embedding dim of 𝐚 𝑡,𝑓 ∕ 𝐞 𝑡,𝑓 in SES/SDS,and loss factor 𝛼 𝑟 and 𝛼 𝑐 .Hyper-params. DAN/SES Conv-TasNet/SDS 𝐵
128 128 𝐻
256 512 𝑃 𝑋 𝑅 𝐚 ∕ 𝐞
20 20 𝛼 𝑟 . - 𝛼 𝑐 . . - the settings in Luo et al. (2018) and Luo & Mesgarani (2019),respectively. We changed the DAN architecture from -layerbi-directional long short-term memory (BLSTM) to TCNblocks, which led to small-sized models, and allowed for faircomparison among different frameworks.The two-stream TD-DAN is composed of the SES andthe SDS, which adopted the architecture corresponding tothe baseline DAN and Conv-TasNet, respectively. By fol-lowing the hyper-parameter notations in (Luo & Mesgarani(2019)), we list the architectures in Table.2. The SES inSec.3.3 accepts LPS which was computed using STFT withwindow size 𝑚𝑠 ) and stride 𝑚𝑠 ) . The SES inSec.3.3.2 utilized convolutional kernels with the same set-tings as the LPS, and then was fine-tuned to 𝑚𝑠 ) and 𝑚𝑠 ) . The power threshold was set to keep the top and the top of the mixture spectrogram bins for LPS en-coder and the time-domain encoder, respectively. The lossfactor 𝛼 𝑑 = 0 . was only applied for the SES with freekernels, which was trained by fine-tuning the network withstacked STFT kernels for epoch.We used Adam (Kingma & Ba (2015)) with the learningrate starting from −3 and then halved if no best validationmodel was found in epochs. The maximum number ofepochs was set to . The TD-DANs were trained with -second segments and a batch size of on GPUs.
5. Results and discussion
In this section, we first present the results of the baselineand the different TD-DANs. Then we describe the experi-ment results on the dataset with variable numbers of speak-ers. Finally, the discussion of the performance is conductedfor further understanding and improvement.
The st part of Table.3 displays the results of our baselinemodels, Conv-TasNet and DAN. The Conv-TasNet achievedSI-SDRs of . . dB, . . dB better than those ofDAN without the concentration loss on the Dev./Eval. set. Itis clear that the concentration loss largely bridged the gap be-tween the oracle attractor and the ones obtained by K-Means,which was reduced from . . to . . dB. Figure 5:
Examples of TD-DAN enhancement under 2- and3-speaker condition. The estimation achieved both derever-beration and separation, but in the 3-speaker condition, it didnot fully removed the interference speaker.
The TD-DAN was designed following the architecturesof DAN and TasNet. Specifically the SES corresponded toDAN while the SDS followed the Conv-TasNet, resulting ina fair comparison with the baseline. As listed in the nd partof Table.3, the SES with the LPS encoder combined with theTasNet-based SDS gave SI-SDRs of . . dB, exceed-ing the SI-SDRs of Conv-TasNet by around . . dBon the Dev./Eval. set. The concentration loss showed its ef-ficiency by indicating an improvement of . . dB. Theresults demonstrate the feasibility of the TD-DAN architec-ture which solves the problem of dereverberation and sep-aration with two parallel streams. Unlike tandem systems,the whole procedure requires no extra information such asanechoic multi-speaker signals.As listed in the rd part of Table.3, the time-domain en-coder of SES was explored with the fixed STFT kernel andthe free kernel. The fixed STFT kernel exhibited perfor-mance degradation with the same window settings of LPS.After window settings were adjusted, the SES encoder wasable to achieve SI-SDRs of . . dB. By setting theSTFT kernel as trainable network parameters, the TD-DANwith the free waveform encoder achieved higher SI-SDRs of . . dB, . . dB higher than those of TD-DANswith the LPS encoders. The merit of the DAN is that it can deal with mixture sig-nals with variable numbers of speakers. To validate this fea-ture, we further trained the TD-DAN on both datasets withboth and speakers. The experiment results are listedin Table.4. The TD-DAN trained on the -speaker dataset Hangting Chen et al.:
Preprint submitted to Elsevier
Page 7 of 10ime-domain deep attractor network
Table 3
The SI-SDR results of the baseline and different TD-DANs on the Dev. and Eval. set.The waveform encoder with † used window size 𝑚𝑠 and stride 𝑚𝑠 , and others adoptedwindow size 𝑚𝑠 and stride 𝑚𝑠 .Model Architecture Waveform encoder Concentration loss SI-SDR(dB)K-Means/Oracle attractorDev. Eval.TasNet 3-block TCN Free - .
05 7 . DAN 2-block TCN LPS (cid:55) . .
17 5 . . DAN 2-block TCN LPS ✓ . .
12 5 . . TD-DAN DAN+TasNet LPS (cid:55) . .
28 8 . . TD-DAN DAN+TasNet LPS ✓ . .
36 8 . . TD-DAN DAN+TasNet LPS † ✓ . .
35 8 . . TD-DAN DAN+TasNet Stacked STFT (cid:55) . .
42 8 . . TD-DAN DAN+TasNet Stacked STFT ✓ . .
39 8 . . TD-DAN DAN+TasNet Stacked STFT † ✓ . .
41 8 . . TD-DAN DAN+TasNet Free † ✓ . ∕ .
𝟔𝟎 𝟖 . ∕ . Table 4
Performance measurement of SI-SDR(dB)/STOI/PESQ under 2- and 3-speaker conditions.Best settings are chosen according to Table 3.Model Speaker
Free . . .
31 7 . . .
24 4 . . .
86 2 . . . TD-DAN LPS . . .
40 8 . . .
32 0 . . .
75 −0 . . . TD-DAN
LPS . . . . . .
39 4 . . .
87 3 . . . TD-DAN Stacked STFT . . .
34 8 . . .
28 0 . . .
73 0 . . . TD-DAN
Stacked STFT . ∕ . ∕ .
𝟒𝟒 𝟖 . ∕ . ∕ .
𝟒𝟎 𝟓 . ∕ . ∕ .
𝟗𝟏 𝟒 . ∕ . ∕ . TD-DAN
Free . . .
42 8 . . .
28 5 . . .
90 4 . . . Mixture −0 . . .
76 −0 . . .
67 −3 . . .
56 −3 . . . IRM(Eq.(5)) . . .
95 8 . . .
77 7 . . .
86 7 . . . IRM(Eq.(6)) . ∕ . ∕3 . . ∕ . ∕3 . . ∕0 . . . ∕0 . . WFM(Eq.(7)) . . .
92 8 . . .
73 7 . . .
84 6 . . . WFM(Eq.(8)) . . . . . ∕ . . . ∕ . . . ∕ . was able to separate -speaker mixture signal with SI-SDRgains of . . dB and . . dB with the LPS andthe STFT waveform encoder on the Dev./Eval. set com-pared with mixture signals, respectively. After fine-tuned onthe concatenated dataset, the TD-DAN achieved SI-SDRs of . . dB and . . dB with the LPS and the STFTwaveform encoder, respectively. The TD-DAN with the freewaveform decoder was tested but no further improvement.On both - and -speaker dataset, the TD-DAN employingthe SES with STFT waveform encoder achieved the best per-formance in all signal measurement scores.The performance of ideal masks is listed in the 2nd partof Table.4. It was observed that IRM (Eq.(6)) outperformedother masks in SI-SDR while WFM (Eq.(8)) achieved thebest STOI. It is notable that the proposed TD-DAN exceededIRM (Eq.(5)) and WFM (Eq.(7)) on the -speaker dataset.The SI-SDR gap of . . dB between IRM (Eq.(6)) andthe TD-DAN in the -speaker dataset indicates that perform-ing multi-speaker separation and dereverberation remains adifficult task even for time-domain techniques. Table 5
Performance analysis (SI-SDR(dB)/STOI/PESQ) of the TD-DAN with STFT waveform encoder under the different rever-beration settings.T60 range speakers speakers (200 𝑚𝑠, 𝑚𝑠 ] 11 . . .
95 5 . . . 𝑚𝑠, 𝑚𝑠 ] 8 . . .
38 3 . . . 𝑚𝑠, 𝑚𝑠 ] 7 . . .
23 2 . . . Fig.5 plots examples of the spectrogram of the mixtureand the enhanced signal. The proposed TD-DAN separatedand dereverberated the reverberant mixture signal to be closerto the early reflections. Table.5 presents the performanceunder different reverberation time. With the larger T60 andmore speakers, the performance of TD-DAN became worse,implying that reverberation makes the separation tasks muchmore difficult.
Hangting Chen et al.:
Preprint submitted to Elsevier
Page 8 of 10ime-domain deep attractor network
6. Conclusion
In this paper, we explore the frameworks of the TD-DANfor speech separation tasks in a reverberant environment withdifferent waveform encoders, including the LPS encoder, thestacked time-domain STFT and free convolutional kernels.The experiment results implied that TD-DAN with the fixedSTFT encoder achieved the best performance, surpassing thebaseline TasNet with SI-SDRs of . . dB and . . dB on the 2- and 3-speaker Dev./Eval. dataset, respectively.In future work, we anticipate further exploring the free wave-form encoder. Moreover, the multi-channel information isexpected to be utilized for better dereverberation and sepa-ration. Acknowledgment
This work is partially supported by the Strategic Prior-ity Research Program of Chinese Academy of Sciences (No.XDC08010300), the National Natural Science Foundationof China (Nos. 11590772, 11590774, 11590770, 11774380).
References
Bahmaninezhad, F., Wu, J. Y., Gu, R., Zhang, S.-X., Xu, Y., Yu, M., & Yu,D. (2019). A comprehensive study of speech separation: spectrogramvs waveform separation. In
INTERSPEECH .Bradley, J. S., Sato, H., & Picard, M. (2003). On the importance of earlyreflections for speech in rooms.
The Journal of the Acoustical Society ofAmerica ,
113 6 , 3233–44.Drude, L., Heitkaemper, J., Böddeker, C., & Haeb-Umbach, R.(2019). Sms-wsj: Database, performance measures, and baselinerecipe for multi-channel source separation and recognition.
ArXiv , abs/1910.13934 .Du, J., Tu, Y., Dai, L.-R., & Lee, C.-H. (2016). A regression approachto single-channel speech separation via high-resolution deep neural net-works. IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , , 1424–1437.Fan, C., hua Tao, J., Liu, B., Yi, J., & Wen, Z. (2020). Simultaneousdenoising and dereverberation using deep embedding features. ArXiv , abs/2004.02420 .Geetha, K., & AYATHRI (2017). Learning spectral mapping for speechdereverberation and denoising.Hershey, J. R., Chen, Z., Roux, J. L., & Watanabe, S. (2016). Deep cluster-ing: Discriminative embeddings for segmentation and separation. , (pp. 31–35).Heymann, J., Drude, L., Haeb-Umbach, R., Kinoshita, K., & Nakatani, T.(2019). Joint optimization of neural network-based wpe dereverberationand acoustic model for robust online asr. ICASSP 2019 - 2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , (pp. 6655–6659).Ito, N., Schymura, C., Araki, S., & Nakatani, T. (2018). Noisy cgmm:Complex gaussian mixture model with non-sparse noise model for jointsource separation and denoising. , (pp. 1662–1666).Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimiza-tion.
CoRR , abs/1412.6980 .Kinoshita, K., Delcroix, M., Gannot, S., Habets, E. A. P., Haeb-Umbach, R.,Kellermann, W., Leutnant, V., Maas, R., Nakatani, T., Raj, B., Sehr, A.,& Yoshioka, T. (2016). A summary of the reverb challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP Journal on Advances in Signal Processing , , 1–19.Kodrasi, I., & Doclo, S. (2017). Evd-based multi-channel dereverbera-tion of a moving speaker using different retf estimation methods. ,(pp. 116–120).Kolbaek, M., Yu, D., Tan, Z.-H., & Jensen, J. (2017). Multitalker speechseparation with utterance-level permutation invariant training of deeprecurrent neural networks. IEEE/ACM Transactions on Audio, Speech,and Language Processing , , 1901–1913.Luo, Y., Chen, Z., & Mesgarani, N. (2018). Speaker-independent speechseparation with deep attractor network. IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , , 787–796.Luo, Y., & Mesgarani, N. (2018). Tasnet: Surpassing ideal time-frequencymasking for speech separation.Luo, Y., & Mesgarani, N. (2019). Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , , 1256–1266.Maciejewski, M., Wichern, G., McQuinn, E., & Roux, J. L. (2019).Whamr!: Noisy and reverberant single-channel speech separation. ArXiv , abs/1910.10279 .Nakatani, T., & Kinoshita, K. (2019a). Maximum likelihood convolutionalbeamformer for simultaneous denoising and dereverberation. , (pp. 1–5).Nakatani, T., & Kinoshita, K. (2019b). A unified convolutional beamformerfor simultaneous denoising and dereverberation. IEEE Signal Process-ing Letters , , 903–907.Nakatani, T., Takahashi, R., Ochiai, T., Kinoshita, K., Ikeshita, R., Del-croix, M., & Araki, S. (2020). Dnn-supported mask-based convolutionalbeamforming for simultaneous denoising, dereverberation, and sourceseparation. In ICASSP 2020 .Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., & Juang, B.-H.(2010). Speech dereverberation based on variance-normalized delayedlinear prediction.
IEEE Transactions on Audio, Speech, and LanguageProcessing , , 1717–1731.Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Devito, Z., Lin, Z.,Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiationin pytorch.Schmid, D., Malik, S., & Enzner, G. (2012). An expectation-maximizationalgorithm for multichannel adaptive speech dereverberation in thefrequency-domain. , (pp. 17–20).Schwartz, O., Gannot, S., & Habets, E. A. P. (2016). Joint maximum likeli-hood estimation of late reverberant and speech power spectral density innoisy environments. , (pp. 151–155).Shi, Z., Lin, H., Liu, L., Liu, R., Hayakawa, S., & Han, J. (2019).Furcax: End-to-end monaural speech separation based on deep gated(de)convolutional neural networks with adversarial example training. ICASSP 2019 - 2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , (pp. 6985–6989).Wang, Y., Narayanan, A., & Wang, D. (2014). On training targets for su-pervised speech separation.
IEEE/ACM Transactions on Audio, Speech,and Language Processing , , 1849–1858.qiu Wang, Z., Roux, J. L., & Hershey, J. R. (2018a). Alternative objectivefunctions for deep clustering. , (pp. 686–690).qiu Wang, Z., Roux, J. L., & Hershey, J. R. (2018b). Multi-channel deepclustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation. , (pp. 1–5).Williamson, D. S., & Wang, D. (2017). Time-frequency masking in thecomplex domain for speech dereverberation and denoising. IEEE/ACMTransactions on Audio, Speech, and Language Processing , , 1492–1501.Yoshioka, T., & Nakatani, T. (2012). Generalization of multi-channel linearprediction methods for blind mimo impulse response shortening. IEEETransactions on Audio, Speech, and Language Processing , , 2707–2720.Zhang, X.-L., & Wang, D. (2016). A deep ensemble learning method formonaural speech separation. IEEE/ACM Transactions on Audio, Speech,and Language Processing , , 967–977. Hangting Chen et al.:
Preprint submitted to Elsevier
Page 9 of 10ime-domain deep attractor network
Zmolíková, K., Delcroix, M., Kinoshita, K., Ochiai, T., Nakatani, T., Bur-get, L., & Ćernocký, J. (2019). Speakerbeam: Speaker aware neuralnetwork for target speaker extraction in speech mixtures.
IEEE Journalof Selected Topics in Signal Processing , , 800–814. Hangting Chen et al.: