Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output
JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Beam-Guided TasNet: An Iterative SpeechSeparation Framework with Multi-Channel Output
Hangting Chen, and Pengyuan Zhang,
Member, IEEE
Abstract —Time-domain audio separation network (TasNet) hasachieved remarkable performance in blind source separation(BSS). Classic multi-channel speech processing framework em-ploys signal estimation and beamforming. For example, Beam-TasNet links multi-channel convolutional TasNet (MC-Conv-TasNet) with minimum variance distortionless response (MVDR)beamforming, which leverages the strong modelling ability ofdata-driven MC-Conv-TasNet and boosts the performance ofbeamforming with an accurate estimation of speech statistics.Such integration can be viewed as a directed acyclic graphby accepting multi-channel input and generating multi-sourceoutput. In this letter, we design a “multi-channel input, multi-channel multi-source output” (MIMMO) speech separation sys-tem entitled “Beam-Guided TasNet”, where MC-Conv-TasNetand MVDR can interact and promote each other more com-pactly under a directed cyclic flow. Specifically, the first stageuses Beam-TasNet to generate estimated single-speaker signals,which favours the separation in the second stage. The proposedframework facilitates iterative signal refinement with the guide ofbeamforming and seeks to reach the upper bound of the MVDR-based methods. Experimental results on the spatialized WSJ0-2MIX demonstrate that the Beam-Guided TasNet has achievedan SDR of 20.7 dB, which exceeded the baseline Beam-TasNetby 4.2 dB under the same model size and narrowed the gap withthe oracle signal-based MVDR to 2.9 dB.
Index Terms —Speech separation, multi-channel speech pro-cessing, MVDR, time-domain network
I. I
NTRODUCTION S PEECH separation has achieved remarkable advancessince the introduction of deep learning. When a speechsignal is captured by a microphone array, spatial informationcan be leveraged to separate sources from different direc-tions. A conventional framework consists of mask estimation,beamforming, and an optional post-filtering for “multi-channelinput, multi-source output” [1], [2]. The minimum variancedistortionless response (MVDR) beamformer requires estima-tion of the spatial correlation matrices (SCMs), typically com-puted based on the estimated speech and noise masks. Sincethe considerable speech separation performance achieved bythe time-domain audio separation network (TasNet) [3], therecently proposed Beam-TasNet [4] uses the estimated time-domain signals to compute the SCMs, which has outperformedthe MVDR based on the oracle frequency-domain masks.
This work is partially supported by the National Natural Science Foundationof China (Nos.62071461,11774380)Hangting Chen and Pengyuan Zhang are with Key Laboratory ofSpeech Acoustics and Content Understanding, Institute of Acoustics, Chi-nese Academy of Sciences, Beijing 100190, China, email: [email protected], [email protected] Zhang is the corresponding author.
Besides beamforming, purely deep learning-based methodshave been developed to capture spatial information. Theinter-channel phase differences (IPDs) are widely-used spatialfeatures [5], [6], [7]. A learnable kernel was proposed toderive data-driven features [8]. End-to-end learning paradigmswere also proposed to perform the separation task totally onthe time domain [8], [9], [10]. Parallel with blind sourceseparation (BSS), speech separation assisted by the directionalinformation estimates the signals of the target speaker withadditional angle feature [11], [12], [13], [14], [15].In this letter, we adopt “multi-channel input, multi-channelmulti-source output” (MIMMO) at the first time to design amulti-channel separation framework entitled “Beam-GuidedTasNet”, which shows a promising potential of learningdata-driven models guided by beamforming. Specifically, theframework utilizes two sequential Beam-TasNets for 2-stageprocessing. The first stage uses a multi-channel convolutionalTasNet (MC-Conv-TasNet) and the MVDR beamforming toperform BSS. In the second stage, an MC-Conv-TasNet guidedby MVDR-beamformed signals can refine separated signalsiteratively. Experiments on the spatialized WSJ0-2MIX [5]exhibited significant performance improvement compared withthe baseline Beam-TasNet. The contributions are as follows: • The proposed framework can guide signal refinement in thesecond stage by integrating the powerful strength of MVDRinto the MC-Conv-TasNet. Experiments showed that the 2-stage framework surpassed the baseline Beam-TasNet by asignal-to-distortion ratio (SDR) of . dB. • The directed cyclic flow of multi-channel signals promotesthe MC-Conv-TasNet and MVDR iteratively and seeks toreach the upper bound of the MVDR-based methods, whichobtained an SDR of . dB and narrowed the gap betweenthe estimated and oracle signal-based MVDR to . dB. • A causal Beam-Guide TasNet is explored for online process-ing, illustrating that the Beam-Guided TasNet is effectiveeven though the utterance-level information is unreachable.The performance degradation caused by causality was alle-viated, with SDRs improved from . dB to . dB byreplacing Beam-TasNet with the Beam-Guided TasNet.In the rest of the letter, we first describe the proposed Beam-Guided TasNet. Experimental setup and results are presentedin Section III and IV. Section VI concludes this work.II. T HE PROPOSED B EAM -G UIDED T AS N ET A. Beam-TasNet
Suppose that speech signals from S sources are captured by C microphones, a r X i v : . [ ee ss . A S ] F e b OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
Fig. 1. (a) Beam-Guided TasNet with a 2-stage framework for iterative refinement. (b) The signal processing routine in the Beam-TasNet, the first and thesecond stage model. The dashed lines are the additional input for the second stage model. y c = S (cid:88) s =1 x s,c . (1)The Beam-TasNet integrates the time-domain network and thebeamforming to estimate signal image x s,c on microphone c from source s with a given mixture y c . As plotted in Fig.1(b),the baseline Beam-TasNet is mainly composed of an MC-Conv-TasNet [8], a permutation solver, and an MVDR beam-former. Given a multi-channel input { y c } c =1 ,...,C indicating acollection of y c along channels, MC-Conv-TasNet generates ˆ x s,c representing the estimated image of source s on channel c . The MC-Conv-TasNet utilizes a parallel encoder ParEncfor encoding the input multi-channel signal into a temporal-spectro representation R c : R c = ParEnc ( { y c } c =1 ,...,C , c ) , (2)a separator to estimate the temporal-spectro masks: { ˆ M s,c } s =1 ,...,S = Seperator ( R c ) , (3)and a decoder to recover the single-speaker waveform: ˆ z s,c = Dec ( ˆ M s,c (cid:12) R c ) , (4)where (cid:12) is Hadamard product, c indicates the referencechannel and can be determined by the order of the input. In thetesting phase, we can obtain { ˆ z s,c } s =1 ,...,S,c =1 ,...,C by settingeach channel as the reference channel with different inputorder. The permutation solver determines the source order bycomparing the similarity across channels with the output of thefirst channel. The MVDR beamformer accepts the reorderedestimation and calculates the SCM for each source, ˆΦ Target s f = 1 T T (cid:88) t =1 ˆ Z s,t,f ˆ Z H s,t,f (5) ˆΦ Interfer s f = 1 T T (cid:88) t =1 ( Y t,f − ˆ Z s,t,f )( Y t,f − ˆ Z s,t,f ) H (6) Since the permutation solver is to keep estimated sources from differentchannel in a same order, we can also choose other channels as the permutationreference. where ˆΦ Target s / Interfer s f denotes the speech/interference SCMsfor source s , Y and ˆZ denotes the short-time Fourier transform(STFT) spectra of { y c } c =1 ,...,C and { ˆ z s,c } c =1 ,...,C , · H denotesHermitian transpose. The signal enhanced by the MVDRbeamforming is calculated by ˆ x s,c = MVDR ( Φ Target s f , Φ Interfer s f , c ) H Y t,f , (7)where reference channel c can be indicated by a one-hot vector[16].In summary, the Beam-TasNet uses MC-Conv-TasNet to es-timate { ˆ z s,c } s =1 ,...,S,c =1 ,...,C the multi-channel image signalswith each channel served as the reference channel and thendo beamforming on the reference channel c , which can beformulated as ˆ x s,c = Beam-TasNet ( { y c } c =1 ,...,C , c ) . (8) B. Beam-Guided TasNet
As plotted in Fig.1(a), the first stage in the Beam-GuidedTasNet employs the original Beam-TasNet, which performsBSS with the MVDR beamforming. In the second stage, thenetwork performs source separation additionally guided by thebeamformed signal. The parallel encoder of the MC-Conv-TasNet in the second stage accepts ( C + S ) channels, including C -channel mixtures and S -speaker beamformed signals fromchannel c .As shown in Fig.1(b), we first feed the mixture signal y c through the MC-Conv-TasNet and do beamforming to obtainthe enhanced single-speaker signals ˆ x (1:1) s,c , ˆ x (1:1) s,c = Beam-TasNet (1) ( { y c } c =1 ,...,C , c ) . (9)where superscript · (1:1) indicates that the signal is generatedby the first iteration and the first stage. Then the second stageuses a second Beam-TasNet to accept ˆ x (1:1) s,c and y c and togenerate ˆ x (1:2) s,c , ˆ x (1:2) s,c = Beam-TasNet (2) ( { y c } c =1 ,...,C , { ˆ x (1:1) s,c } s =1 ,...,S , c ) . (10)In such a way, the second Beam-TasNet integrates the strengthof the MVDR beamforming into the data-driven model. Dif-ferent from target speaker extraction [17] and neural spatial OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 filtering [11], [15], we deduce the source information by theenhanced signal calculated by the MVDR beamforming.The framework leads to a direct cyclic flow of multi-channelsignals with iterative refinement implemented on the secondstage (Fig.1(a)). MIMMO is achieved by separately settingeach channels as the reference channel in both the MC-Conv-TasNet and the MVDR beamforming. In this way, the secondstage can iteratively accept ˆ x ( n − s,c and generate ˆ x ( n :2) s,c , ˆ x ( n :2) s,c = Beam-TasNet (2) ( { y c } c =1 ,...,C , { ˆ x ( n − s,c } s =1 ,...,S , c ) , (11)where n = 2 , , ... denotes the iteration number. Thewhole procedure can be viewed as an “E-M-like algo-rithm”, i.e. , the “E-step” uses MVDR beamforming toestimate the distortionless signals with the given SCMs(MVDR (ˆ x ( n :2) s,c | y c , ˆΦ ( n − ) ), the “M-step” uses MC-Conv-TasNet to find an optimal set of SCMs with the given distor-tionless signals (MC-Conv-TasNet ( ˆΦ ( n :2) | y c , ˆ x ( n :2) s,c ) ). C. The causal variant
A causal Beam-Guided TasNet is explored with the causalMC-Conv-TasNet and the frame-by-frame updated MVDR.We use channel-wise layer normalization to replace globallayer normalization in the causal MC-Conv-TasNet [3], [18].The permutation solver and MVDR are updated in a frame-by-frame way, whose formulas can be found in Appendix A.III. E
XPERIMENTAL SETUP
This section describes our experimental setup, includingthe dataset, settings of the proposed method, and evaluationmetrics.We evaluate the proposed framework on the spatialized ver-sion of the WSJ0-2MIX corpus [5]. The reverberant mixtureswere generated by convolving the room impulse responses(RIRs) with the clean single-speaker utterances. The RIRswere randomly sampled with sound decay time (T60) from0.2s to 0.6s. The signal-to-interference ratio was sampled from − dB to +5 dB. The dataset contains , ( ∼ h ), , ( ∼ h ), and , ( ∼ h ) multi-channel two-speakermixtures in the training, development and evaluation set,respectively. The training and the development sets weregenerated with a sample rate of 8kHz and a mode of “min”;the testing set was generated with a sample rate of 8kHz anda mode of “max” since the evaluation of word error rates(WERs) need complete sentences.We used the mixtures in the training set to train themodel, conducted early stopping and model selection on thedevelopment set, and reported the enhancement results on thetest set. Only the first channels out of were used to trainand evaluate the models. The channel order was randomlyshuffled during training, where the reverberant single-speakersignal captured by the 1st channel was viewed as the referencesignal and was used as the learning target. In evaluation, thedefault first channel was chosen as the reference.The experiments were conducted using PyTorch toolkit [19]and Asteroid toolkit [20]. The Beam-TasNet was composedof two modules, MC-Conv-TasNet and MVDR beamforming. TABLE IT
HE SETTINGS OF THE HYPER - PARAMETERS OF
MC-C
ONV -T AS N ET INTHE BASELINE B EAM -T AS N ET AND THE PROPOSED B EAM -G UIDED T AS N ET WITH STAGES . T
HE NOTATIONS FOLLOW [3].
Hyper-parameter Baseline First/Second stage N
512 256 L
16 16 B
128 128 S c
128 128 H
512 256 P X R Model size . M . M/ . M Unlike [4], we did not use voice activity detection-basedrefinement for simplicity and fair comparison. The two modelsin Fig.1(a) were trained sequentially with the permutationinvariant training (PIT) and the loss of source-to-noise ratio(SNR) [21]. The detailed model architecture is listed in Table I,where the Beam-Guided TasNet had a roughly equal numberof parameters with the baseline Beam-TasNet. The learningrate of the Adam [22] optimizer was initialized with − andwas halved if a lower validation loss was not found within epochs. The models were trained with -second segmentsand a maximum of epochs. After training the first stagemodel, we used it to generate the MVDR-beamformed signalswith each channel regarded as the reference channel, i.e. , -channel signal input, -channel signal output. A stacked -channel signal (the -speaker beamformed signal, the -channel mixture signal) was fed into the second model to trainthe second stage enhancement.The window settings of the STFT were set as a msframe length and a ms hop size in MVDR due to theconsiderable reverberant time. In the frame-by-frame process-ing, the MVDR calculation was performed frame-wisely toobtain the SCMs, MVDR filters, and enhanced signals.We used BSS-Eval SDR [23] and WERs as the evaluationmetrics. The SDR metric was calculated by comparing theestimated ˆ x s, or ˆ z s, with reference signal x s, . The automaticspeech recognition (ASR) system was trained following thescripts offered by the spatialized multi-speaker WSJ (SMS-WSJ) dataset [24] to make the WER results reproducible.The acoustic model was trained using the single-speakerreverberant utterances. The speakers of the training set ofSMS-WSJ did not overlap with those in evaluation.IV. R ESULTS AND DISCUSSION
In this section, we first perform an ablation study ofBeam-Guided TasNet and compared the performance withthe baseline Beam-TasNet and the oracle MVDR. Then, acausal framework was explored to illustrate the effectivenessof the framework without the future information. Finally, wevisualized the iterative processing to demonstrate how theframework boosts the performance with the guide of MVDR.Table II lists the SDR and WER results on the baselineBeam-TasNet and the proposed Beam-Guided TasNet underthe non-causal condition. Compared with the baseline Beam-TasNet, the first stage model adopted a small-sized model
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
TABLE IIC
OMPARISON OF B EAM -T AS N ET AND B EAM -G UIDED T AS N ET WITHDIFFERENT ADDITIONAL INPUT FOR THE ND STAGE MODEL UNDER THENON - CAUSAL CONDITION . T
HE ANGLE FEATURE (AF) [11]
WASOBTAINED BY THE SOURCE DIRECTION CALCULATED BY
SRP-PHAT [25].T
HE GRAY CELLS SHARE THE SAME RESULTS WITH THOSE IN T ABLE
III.
Model Additional SDR ↑ (dB) WER ↓ (%)input ˆ z s, ˆ x s, ˆ z s, ˆ x s, Beam-TasNet - . . . . . . . . n = 1 ) ˆ x (1:1) s,c . . . . n = 1 ) ˆ z (1:1) s,c . . . . n = 1 ) AF . . . . Iterative ( n = 3 ) ˆ x ( n :1) s,c . . . . Oracle mask - . . . . Oracle signal - ∞ . . . TABLE IIIT
HE PERFORMANCE OF THE CAUSAL SYSTEMS . T
HE GRAY CELLS SHARETHE SAME RESULTS WITH THOSE IN T ABLE
II.
Model Causal SDR ↑ (dB) WER ↓ (%)TasNet MVDR ˆ z s, ˆ x s, ˆ z s, ˆ x s, Beam-TasNet (cid:51) (cid:51) . . . . (cid:55) (cid:55) . . . . (cid:55) (cid:51) . . . . (cid:51) (cid:55) . . . . (cid:51) (cid:51) . . . . n = 1 ) (cid:51) (cid:51) . . . . Iterative ( n = 3 ) - (cid:51) . . . . Oracle mask - (cid:51) . . . . Oracle signal - (cid:51) ∞ . . . and achieved an SDR degradation of . dB and a WERdegradation of . . Using the second stage yielded SDRimprovement and WER reduction with extra input of ˆ x s,c , ˆ z s,c and angle feature. The one with ˆ x s,c obtained the bestperformance with the SDR improved by . dB and the WERreduced by . compared with the first stage. The MVDRbeamformer is thought to play a crucial role in performanceimprovement since its output ˆ x (1:1) s,c presented a much higherSDR than ˆ z (1:1) s,c . With the Beam-Guided TasNet and iterativeprocessing, the SDR and the WER was optimized to . dBand . . On the other hand, for oracle MVDR, ˆ z s, equalsto x s, for the oracle signal, ˆ z s, was calculated based on theideal ratio masks for the oracle mask. The proposed Beam-Guided TasNet dramatically narrowed the SDR and the WERgap with the oracle signal-based MVDR to . dB and . and exceeded those of the oracle mask-based MVDR by . dB and . , respectively.Table III lists the experiment results with the causal modeland MVDR. Introducing causality into MC-Conv-TasNet andMVDR degraded the performance, where the SDR and theWER were . dB and . worse than non-causal modelin the first stage. With the Beam-Guided TasNet and it-erative processing, the SDR and the WER was optimizedfrom . dB and . to . dB and . . Again,the Beam-Guided TasNet exceeded that of the oracle mask-based MVDR and the baseline Beam-TasNet by . dB and . dB, respectively. Besides, we found that the oracle signalexhibited a narrow SDR gap of . dB between the non-casual S D R ( d B ) SDR( z s ,1 , x s ,1 )SDR( x s ,1 , x s ,1 ) SDR( z s ,1 , x s ,1 )SDR( x s ,1 , x s ,1 ) W E R ( % ) WER( z s ,1 )WER( x s ,1 ) WER( z s ,1 )WER( x s ,1 ) Fig. 2. SDR(dB)/WER(%) vs. iteration:stage ( n : 1 / ) under the causal/non-causal condition. The dashed lines are the results of the baseline Beam-TasNet. ( . dB) and causal MVDR ( . dB). The reason mightbe that the signals estimated by MC-Conv-TasNet and oraclemasks presented a more variable inter-channel pattern, whichled to inaccurate computation in the causal MVDR, i.e. , theoracle mask removed the phase information, while the MC-Conv-TasNet was trained without inter-channel constraint. Asimilarity measurement between causal and non-causal filterscan be found in Appendix B.The iterative processing is visualized in Fig.2, where theSDR and WER curves exhibit a nearly same trend on thenon-causal and causal setting. We explain the following phenomena. First, the lines of SDRs raise and intersect,indicating that the Beam-Guided TasNet took the strength ofMC-Conv-TasNet and MVDR to optimize each other. With amore accurate estimation of SCMs, the MVDR beamforminggot closer to its upper bound gradually. However, the outputof MC-TasNet in the current iteration could always achieve abetter SDR than the output of MVDR in the previous iteration,which made ˆ z ( n :2) s, surpass ˆ x ( n :2) s, at some point. Second, wefound that after 2 or 3 iterations, the Beam-Guided TasNetcould achieve best performance. More iterations might lead toperformance degradation due to negative feedback. This trendwas also observed with other extra input, whose results canbe found in Appendix C. Third, the WER gap between ˆ z ( n :2) s, and ˆ x ( n :2) s, was eliminated after iterations. Under the non-causal condition, the distortionless ˆ x ( n :2) s, exhibited slightlylower WERs. Under the causal condition, however, the WERcurve indicated that ˆ z ( n :2) s, obtained better signal quality dueto the inaccurate MVDR filter.V. C ONCLUSION
In this letter, we propose the Beam-Guided TasNet with a -stage framework, which refines the multi-channel BSS itera-tively with the guide of beamforming. The framework acceptsthe multi-channel input and generates multi-channel signalsfor different speakers. The experiments presented considerableSDR improvement of . dB and . dB compared withthe baseline Beam-TasNet under the non-causal and causalcondition, respectively. In future work, we will further explorethe design of MIMMO with novel network architectures andloss constraints. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 R EFERENCES[1] H. Erdogan, J. Hershey, S. Watanabe, M. I. Mandel, and J. L. Roux,“Improved mvdr beamforming using single-channel mask predictionnetworks,” in
INTERSPEECH , 2016.[2] N. Kanda, C. B¨oddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi,K. Nagamatsu, and R. Haeb-Umbach, “Guided source separation meetsa strong asr backend: Hitachi/paderborn university joint investigation fordinner party asr,” in
INTERSPEECH , 2019.[3] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing idealtime–frequency magnitude masking for speech separation,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 27, pp.1256–1266, 2019.[4] T. Ochiai, M. Delcroix, R. Ikeshita, K. Kinoshita, T. Nakatani, andS. Araki, “Beam-tasnet: Time-domain audio separation network meetsfrequency-domain beamformer,”
ICASSP 2020 - 2020 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 6384–6388, 2020.[5] Z. qiu Wang, J. L. Roux, and J. Hershey, “Multi-channel deep clustering:Discriminative spectral and spatial embeddings for speaker-independentspeech separation,” , pp. 1–5, 2018.[6] Z. Wang and D. Wang, “Combining spectral and spatial features fordeep learning based blind speaker separation,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 27, pp. 457–468, 2019.[7] L. Chen, M. Yu, D. Su, and D. Yu, “Multi-band pit and modelintegration for improved multi-channel speech separation,”
ICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pp. 705–709, 2019.[8] R. Gu, J. Wu, S. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou,and D. Yu, “End-to-end multi-channel speech separation,”
ArXiv , vol.abs/1905.06286, 2019.[9] J. Zhang, C. Zorila, R. Doddipatla, and J. Barker, “On end-to-end multi-channel time domain speech separation in reverberant environments,”
ICASSP 2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pp. 6389–6393, 2020.[10] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end micro-phone permutation and number invariant multi-channel speech separa-tion,”
ICASSP 2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pp. 6394–6398, 2020.[11] R. Gu, L. Chen, S. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, andD. Yu, “Neural spatial filter: Target speaker speech separation assistedwith directional information,” in
INTERSPEECH , 2019.[12] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, “Multi-channel overlapped speech recognition with location guided speechextraction network,” , pp. 558–565, 2018.[13] Z. Zhang, Y. Xu, M. Yu, S. Zhang, L. Chen, and D. Yu, “Adl-mvdr: Alldeep learning mvdr beamformer for target speech separation,” arXiv:Audio and Speech Processing , 2020.[14] R. Gu, S. Zhang, Y. Xu, L. Chen, Y. Zou, and D. Yu, “Multi-modal multi-channel target speech separation,”
IEEE Journal of Selected Topics inSignal Processing , vol. 14, pp. 530–541, 2020.[15] R. Gu and Y. Zou, “Temporal-spatial neural filter: Direction in-formed end-to-end multi-channel target speech separation,”
ArXiv , vol.abs/2001.00391, 2020.[16] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domainmultichannel linear filtering for noise reduction,”
IEEE Transactions onAudio, Speech, and Language Processing , vol. 18, pp. 260–276, 2010.[17] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Bur-get, and J. Cernocky, “Speakerbeam: Speaker aware neural network fortarget speaker extraction in speech mixtures,”
IEEE Journal of SelectedTopics in Signal Processing , vol. 13, pp. 800–814, 2019.[18] S. Sonning, C. Sch¨uldt, H. Erdogan, and S. Wisdom, “Performance studyof a convolutional time-domain audio separation network for real-timespeech denoising,”
ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , pp. 831–835,2020.[19] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,” 2017.[20] M. Pariente, S. Cornell, J. Cosentino, S. Sivasankaran, E. Tzinis,J. Heitkaemper, M. Olvera, F.-R. St¨oter, M. Hu, J. M. Mart´ın-Do˜nas,D. Ditter, A. Frank, A. Deleforge, and E. Vincent, “Asteroid: thePyTorch-based audio source separation toolkit for researchers,” in
Proc.Interspeech , 2020. [21] J. L. Roux, S. Wisdom, H. Erdogan, and J. Hershey, “Sdr – half-bakedor well done?”
ICASSP 2019 - 2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , pp. 626–630, 2019.[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
CoRR , vol. abs/1412.6980, 2015.[23] E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurementin blind audio source separation,”
IEEE Transactions on Audio, Speech,and Language Processing , vol. 14, pp. 1462–1469, 2006.[24] L. Drude, J. Heitkaemper, C. B¨oddeker, and R. Haeb-Umbach, “Sms-wsj: Database, performance measures, and baseline recipe for multi-channel source separation and recognition,”
ArXiv , vol. abs/1910.13934,2019.[25] M. Brandstein and H. Silverman, “A robust method for speech signaltime-delay estimation in reverberant rooms,” , vol. 1, pp.375–378 vol.1, 1997.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 A PPENDIX
A. Frame-by-frame processing
For online frame-by-frame processing, the permutationsolver calculates metrics based on the received signal toconduct source reorder in a frame-by-frame method. In ourpractice, the distance measurement methods, such as Euclideannorm and correlation, can achieve similar performance. Herewe use SNR to reorder the sources, which corresponds toEuclidean norm. The causal permutation solver obtains theorder ˆ π c,t , which can be expressed as, ˆ π c,t = argmax π c,t S (cid:88) s =1 SNR (cid:0) ˆ x s, [0 : n t ] , ˆ x π c,t ( s ) ,c [0 : n t ] (cid:1) , (12)where n t denotes the number of received samples until frame t . The SCMs are updated as the followings: ˆΦ Target s t,f = t − t ˆΦ Target s t − ,f + 1 t ˆ Z s,t,f ˆ Z H s,t,f , (13) ˆΦ Interfer s t,f = t − t ˆΦ Interfer s t − ,f + 1 t ( Y t,f − ˆ Z s,t,f )( Y t,f − ˆ Z s,t,f ) H , (14)where ˆ Z s,t,f is reordered by ˆ π c,t . B. Analysis of causal MVDR filters
Angle feature depicts the similarity of the inter-channel timedelay between the multi-channel signal and the steering vectorwith the phase difference [11]. The similarity between the non-causal and causal MVDR filters is calculated in a similar way: α t,f,s,c = MVDR ( Φ Target s t,f , Φ Interfer s t,f , c ] MVDR ( Φ Target s t,f , Φ Interfer s t,f , , (15)SM t = 1 F × S × ( C − c (cid:54) =1 (cid:88) f,s,c α HT,f,s,c α t,f,s,c | α T,f,s,c || α t,f,s,c | , (16)where α t,f,s,c denotes the differences between the st and c th channel of the causal MVDR filter calculated based on t frames for source s and frequency f , | · | takes the gaininformation, SM t is the averaged similarity on time frame t .We present the growth curve of the similarity of filters cal-culated between the causal and non-causal condition (Fig.3).We use the real part of SM t , where high value representshigh similarity. The filters obtained from the oracle signalsurpassed with signals less than 3 seconds, but MVDRusing the signals inferred from the Beam-Guided TasNet andthe oracle masks cost more than 5 seconds. We think that thehigh similarity represents a stable pattern of the inter-channeltime delay, while the low similarity means the lack of the inter-channel constraint. The curves reflect the large performancegap between the non-causal and causal MVDR of the causalBeam-Guided TasNet ( . dB vs. . dB) and the oraclefrequency-domain mask ( . dB vs. . dB), and the smallgap of the oracle signal ( . dB vs. . dB). S i m il a r i t y ( % ) Oracle signalBeam-Guided TasNet (causal)Oracle mask
Fig. 3. The growth curves of the similarity between the causal and non-causalMVDR filter with different signals. The curves are obtained by averaging thegrowth curves over the test set. ( a ) x s , c SDR( z s ,1 , x s ,1 )SDR( x s ,1 , x s ,1 ) WER( z s ,1 )WER( x s ,1 ) ( b ) z s , c SDR( z s ,1 , x s ,1 )SDR( x s ,1 , x s ,1 ) WER( z s ,1 )WER( x s ,1 ) ( c ) A n g l e f e a t u r e SDR( z s ,1 , x s ,1 )SDR( x s ,1 , x s ,1 ) WER( z s ,1 )WER( x s ,1 ) Fig. 4. SDRs/WERs vs. iteration:stage with different input under the non-causal condition. The dashed lines are the results of the baseline Beam-TasNet.
C. SDR and WER trends with additional input
Fig.4 plots the SDR and WER trend with different extrainput under the non-causal condition. With ˆ z s,c and anglefeature, the second stage model can achieve better performancethan the first stage model, but failed to surpass the signalsestimated by MVDR. Meanwhile, the iterative processing ledto little further improvement. Performance degradation wasnoticed in Fig.4(a)-(b) after more iterations. It was expectedthat the second stage model worked as a positive feedback sys-tem. However, as the output presented less SDR improvement,the model was turned into a negative feedback system, whichwas more evident in Fig.(b). Since ˆ x s,cs,c