Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism
TTIME-DOMAIN SPEECH EXTRACTION WITH SPATIAL INFORMATIONAND MULTI SPEAKER CONDITIONING MECHANISM
Jisi Zhang , C˘at˘alin Zoril˘a , Rama Doddipatla and Jon Barker University of Sheffield, Department of Computer Science, Sheffield, UK Toshiba Cambridge Research Laboratory, Cambridge, UK
ABSTRACT
In this paper, we present a novel multi-channel speech extractionsystem to simultaneously extract multiple clean individual sourcesfrom a mixture in noisy and reverberant environments. The pro-posed method is built on an improved multi-channel time-domainspeech separation network which employs speaker embeddings toidentify and extract multiple targets without label permutation ambi-guity. To efficiently inform the speaker information to the extractionmodel, we propose a new speaker conditioning mechanism by de-signing an additional speaker branch for receiving external speakerembeddings. Experiments on 2-channel WHAMR! data show thatthe proposed system improves by 9% relative the source separationperformance over a strong multi-channel baseline, and it increasesthe speech recognition accuracy by more than 16% relative over thesame baseline.
Index Terms — multi-channel source separation, multi-speakerextraction, noise, reverberation
1. INTRODUCTION
Speech separation aims to segregate individual speakers from a mix-ture signal, and it can be used in many applications, such as speakerdiarization, speaker verification or multi-talker speech recognition.Deep learning has allowed an unprecedented separation accuracycompared with the traditional signal processing based methods,however, there are still challenges to address. For instance, in blindsource separation, the order of the output speakers is arbitrary andunknown in advance, which forms a speaker label permutation prob-lem during training. Clustering based methods [1] or, more recently,Permutation Invariant Training (PIT) technique [2] have been pro-posed to alleviate this issue. Although the PIT forces the framesbelonging to the same speaker to be aligned with the same outputstream, frames inside one utterance can still flip between differentsources, leading to a poor separation performance. Alternatively,the initial PIT-based separation model can be further trained witha fixed label training strategy [3], or a long term dependency canbe imposed to the output streams by adding an additional speakeridentity loss [4, 5]. Another issue in blind source separation isthat the speaker order of the separated signals during inference isalso unknown, and needs to be identified by a speaker recognitionsystem.An alternative solution to the label permutation problem is toperform target speaker extraction [6–8]. In this case, the separa-tion model is biased with information about the identity of the targetspeaker to extract from the mixture. Typically, a speech extraction ©2021 IEEE. Accepted for ICASSP 2021. system consists of two networks, one to generate speaker embed-dings, and another one to perform speech extraction. The speakerembedding network outputs a speaker representation from an en-rollment signal uttered by the target. The speaker embedding net-work can be either jointly trained with the speech extraction modelto minimise the enhancement loss or trained on a different task, i.e., aspeaker recognition task, to access larger speaker variations [9]. Thetarget speaker embedding is usually inserted into the middle-stagefeatures of the extraction network by using multiplication [7] or con-catenation operations [8,10], however, the shared middle-features inthe extraction model may not be optimal for both tasks of speakerconditioning and speech reconstruction.Most of the existing speech extraction models enhance only onetarget speaker each time and ignore speech from other speakers.When multiple speakers are of interest, the extraction model has tobe applied several times, which is inconvenient and requires morecomputational resources. Therefore, a system capable of simultane-ously extracting multiple speakers from a mixture is of practical im-portance. Recently, a speaker-conditional chain model (SCCM) hasbeen proposed that firstly infers speaker identities, then uses the cor-responding speaker embeddings to extract all sources [11]. However,SCCM is still trained with the PIT criterion, and the output order ofseparated signals is arbitrary. Lastly, when multiple microphones areavailable, the spatial information has been shown to improve the per-formance of both separation and extraction [7, 12] systems in cleanand reveberant environments. So far, the spatial information has notbeen tested with a multi-speaker extraction system, nor it has beenevaluated in noisy and reverberant environments.In this paper, we reformulate our previous multi-channel speechseparation design in [12] as a multi-talker speech extraction sys-tem. The proposed system uses embeddings from all speakers inthe mixture to simultaneously extract all sources, and does not re-quire PIT to solve the label permutation problem. There are threemain contributions in this work. Firstly, we improve our previousmulti-channel system in [12] by swapping the Temporal fully-Convolutional Network (TCN) blocks with U-Convolutional blocks,which yielded promising results for a recent single-channel speechseparation model [13]. Secondly, the previous modified systemis reformulated to perform multi-speaker extraction, and, lastly, anovel speaker conditioning mechanism is proposed that exploits thespeaker embeddings more effectively. The evaluation is performedwith multi-channel noisy and reverberant 2-speaker mixtures. Weshow that combining the updated multi-channel structure and theproposed speaker conditioning mechanism leads to a significantimprovement in terms of both the separation metric and speechrecognition accuracy.The rest of paper is organised as follows. In section 2, we intro-duce the proposed multi-channel speech extraction approach. Sec-tion 3 presents implementation details and the experiment setup. Re- a r X i v : . [ ee ss . A S ] F e b ults and analysis are presented in Section 4. Finally, the paper isconcluded in Section 5.
2. MULTI-CHANNEL END-TO-END EXTRACTION
Recently, neural network based multi-channel speech separationapproaches have achieved state-of-the-art performance by directlyprocessing time-domain speech signals [12, 14]. These systems in-corporate a spectral encoder, a spatial encoder, a separator, and a de-coder. In [12], spatial features are input to the separator only. In thiswork, we simplify the previous framework by combining the spatialand spectral features as depicted in Figure 1. We found the proposedapproach is beneficial for the speech extraction task. The spectralencoder and spatial encoder independently generate N -dimensionalsingle-channel representations and S -dimensional multi-channelrepresentations, respectively. The spectral encoder is a 1-D con-volutional layer, and the spatial encoder is a 2-D convolutionallayer. The encoded single-channel spectral features and two-channelspatial features are concatenated together to form multi-channelrepresentations with a dimension of ( N + S ) , which are accessedby both the separation module and the decoder. The separator willestimate linear weights for combining the multi-channel representa-tions to generate separated representations for each source. Finally,the decoder (1-D convolutional layer) reconstructs the estimated sig-nals by inverting the separated representations back to time-domainsignals. Fig. 1 . Updated multi-channel model structureCompared with our previous work [12], we also upgrade theseparator by replacing the original TCN [15] blocks with U-Convolutional blocks (U-ConvBlock), which have proven to bemore effective in modelling sequential signals in the single-channelspeech separation task [13]. Furthermore, a system built on U-ConvBlock requires fewer parameters and floating point operationscompared with the systems built on TCN or recurrent neural net-work architectures [16]. The U-ConvBlock (Figure 2) extractsinformation from multiple resolutions using Q successive temporaldownsampling and Q upsampling operations similar to a U-Netstructure [17]. The channel dimension of the input to each U-ConvBlock is expanded from C to C U before downsampling, and iscontracted to the original dimension after upsampling. The updatedseparation module is shown in Figure 3 and consists of a instancenormalisation layer, a bottleneck layer, B stacked U-ConvBlocksand a 1-D convolutional layer with a non-linear activation func-tion. We choose to use an instance normalisation layer [18] ratherthan global layer normalisation for the first layer-normalisation, asthe latter would normalise over the channel dimension which isinappropriate given the heterogeneous nature of the concatenatedfeatures. Building on the modified system described above, in this section weintroduce a novel multi-channel speech extraction system which si-multaneously tracks multiple sources in the mixture. In general, the
Fig. 2 . U-Conv block structure
Fig. 3 . Improved separator with U-Conv blockssystem uses embeddings from multiple speakers as input, which areused to condition single-source outputs with a consistent speaker or-der. Common strategies for supplying speaker information to theextraction model are to modulate the speaker features on middle-level features inside the separation model [6, 19] or concatenate thespeaker features with the mixture speech representations [8]. How-ever, it is not trivial to find a single optimal layer at which to insertthe speaker features. For instance, the shared middle-features in theextraction model may not be optimal for both speaker conditioningand speech reconstruction.To address this issue, we propose a new ‘speaker stack’ forprocessing the input speaker representations to coordinate with themain separation stack, as shown in Figure 4. The speaker stack takesthe encoded multi-channel features and generates two high-levelsequential features, which are suitable to receive speaker informa-tion from externally computed speaker embeddings. The output ofthe speaker branch containing speaker information is encouraged tolearn similar characteristics as the original multi-channel featuresand can be concatenated together as input to the separation stack.Note that the encoder is shared for both the speaker stack and theseparation stack. The speaker stack, illustrated in Figure 5, first em-
Fig. 4 . Proposed multi-channel speech extractor with dedicatedspeaker stack
Fig. 5 . Internal structure of proposed speaker stackloys an instance normalisation, a bottleneck 1-D CNN and a singleTCN block to receive multi-channel features. Then, the output of theTCN block will be factorised by an adaptation layer into multiplefeatures for modulation with multiple speaker embeddings, whichare transformed with a × convolutional layer to the same featuredimension. The modulated signals from each speaker embedding areconcatenated together and processed with a 1-D convolutional layerand a ReLU non-linear activation function to form E -dimensionalspeaker information features, which have the same time length asthe multi-channel features.The speaker stack and the separation stack are jointly trained todirectly optimise the scale-invariant signal-to-noise ratio (SI-SNR)metric [20], SI-SNR = 10log || s target || || e noise || s target = (cid:10) ˆ s, s (cid:11) s || s || , e noise = ˆ s − s target (1)where ˆ s and s denote the estimated and clean source, respectively,and || s || = (cid:10) s, s (cid:11) denotes the signal power. In contrast with PIT,we condition the decoded signals on the speaker representations andkeep the output speaker order consistent with the order of inputspeaker embeddings.
3. EXPERIMENT SETUP3.1. Data simulation
The evaluation is performed on the WHAMR! dataset [21], whichconsists of simulated noisy and reverberant 2-speaker mixtures.WHAMR! is based on Wall Street Journal (WSJ) data, mixed withnoise recorded in various urban environments [22], and artificialroom impulse responses generated by using pyroomacoustics [23]to approximate domestic and classroom environments. There are20k sentences from 101 speakers for training, and 3k sentencesfrom 18 speakers for testing. The speakers in the test set do notappear during training of the speaker recognition model nor theyappear during training of the speaker extraction system. All data arebinaural (2-channels) and have 8 kHz sampling rate.
The multi-channel separation network in [12] trained with PIT hasbeen set as the baseline for comparison. The hyper-parameters ofthe baseline model are the same as those for the best model in theoriginal paper, chosen as follows, N = 256 , S = 36 , R = 3 , X = 7 , L = 20 , and the batch size M = 3 . For the U-ConvBlockbased separation module, the hyper-parameters are set as SuDoRM-RF 1.0x in [13] namely, L = 21 , B = 16 , Q = 4 , C = 256 , C U = 512 , and the training batch size M = 4 . Each utterance issplit into multiple segments with a fixed length of 4 seconds. Thedimension of speaker features, E , in the speaker stack is set to 128.The ADAM optimizer [24] is used for training with a learning rateof e − , which will be halved if the loss of validation set is notreduced in 3 consecutive epochs. All models are trained with 100epochs. The input for all the models is the reveberant mixture withnoise and the targets are the clean individual sources. We retrained the time-domain speaker recognition model SincNet[25] for speaker embedding generation. Employing the same config- uration as in the original paper, SincNet is trained on the clean train-ing set of WSJ0 (101 speakers), using speech segments of 200 mswith 10 ms overlap. The output of the last hidden layer of final Sinc-Net model represents one frame-level speaker embedding for each200 ms segment, and an utterance-level embedding is derived by av-eraging all the frame predictions.Randomly selecting a single enrollment utterance for generatingthe speaker embedding leads to poor extraction performance. There-fore, to increase the robustness, we follow an averaging strategy toobtain one global embedding for each speaker [26]. Specifically,each global speaker embedding is obtained by averaging several em-beddings generated from multiple randomly selected utterances be-longing to the same speaker. During training, one global speakerembedding is generated by averaging all the utterance-level embed-dings from the training utterances belonging to the correspondingspeaker. During evaluation, 3 utterances are randomly selected foreach speaker, and the utterance-level embeddings from the selectedutterances are averaged to form one global embedding. Experimentsshowed that increasing the number of utterances beyond 3 does notimprove performance.
To evaluate the speech recognition performance, two acoustic mod-els have been trained using the WSJ corpus. One model (AM1)was trained on roughly 80 hrs of clean WSJ-SI284 data plus theWHAMR! single-speaker noisy reverberant speech, and the otherone (AM2) was trained on the data used for AM1 plus the separatedsignals from the WHAMR! mixture in the training set processed bythe proposed model. The audio data is downsampled to 8 kHz tomatch the sampling rate of data used for separation experiments.The acoustic model topology is a 12-layered Factorised TDNN [27],where each layer has 1024 units. The input to the acoustic model is40-dimensional MFCCs and a 100-dimensional i-Vector. A 3-gramlanguage model is used during recognition. The acoustic model isimplemented with the Kaldi speech recognition toolkit [28]. Withour set-up, the ASR results obtained with AM1 on the standard cleanWSJ Dev93 and Eval92 are 7.2% and 5.0% WER, respectively.
4. RESULTS AND ANALYSIS4.1. Improved Multi-channel separation network
Table 1 reports the separation performance for the improved multi-channel separation network with various configurations. The firstobservation is that the dimension of the spatial features does not haveto be fixed to a small value (typically 36) as mentioned in the previ-ous work. The results show that when the dimension increases, moreuseful spatial information is extracted and the model benefits morefrom the multi-channel signals. Replacing the TCN blocks with thestacked U-ConvBlocks provides a larger receptive field due to suc-cessive downsampling operations, and the latter model yields 0.5 dBSI-SNR improvement. The configuration depicted in the last row ofTable 1 is used for the rest of the experiments.
Table 1 . Speech separation performance of improved multi-channelstructure on WHAMR! test set
Model S SI-SNRi
Multi-TasNet (TCN) 36 12.1Multi-TasNet (TCN) 64 12.2Multi-TasNet (TCN) 128 12.4Multi-TasNet (U-Conv) 128 12.9 .2. Results of speech extraction system
Three subsets of experiments with different speaker information con-ditioning strategies are performed. The first experiment uses themultiplication strategy applied in SpeakerBeam [7], which modu-lates the speaker embedding on the middle-stage representations inthe separation module, denoted as Multiply. The second experimentrepeats and concatenates the speaker embeddings with the spectraland spatial representations before being fed into the separation mod-ule, denoted as Concat. Lastly, the third experiment uses the pro-posed conditioning mechanism, denoted as Split.
Table 2 . Speech extraction performance with improved multi-channel structure on the WHAMR! test set
Model PIT SI-SNRi
Separation (Improved) (cid:88) (cid:55) (cid:55) (cid:55) (cid:88)
Table 3 . Results on different and same gender mixtures
Model
PIT SI-SNRi
Diff. SameSuDo-RMRF [13] 1 (cid:88) (cid:88) (cid:88) (cid:55) (cid:88)
Table 4 . Comparative results of single and multi-channel speechseparation/extraction on WHAMR! data
Model
Building Unit PIT SI-SNRi
Conv-TasNet [29] 1 TCN (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:55) (cid:88) (cid:55) (cid:88)
Table 5 . Speech recognition results
System
WER(%)
AM1 AM2Mixture - 79.1 77.0Multi-TasNet [12] 2 37.7 -Extraction (Split) 2
Noisy Oracle - 19.8 20.0Table 5 reports the ASR results. The proposed speech extractionmodel yields a significant WER reduction over the noisy reverberantmixture and outperforms the strong multi-channel separation base-line. The extraction system can introduce distortions to the separatedsignals (causing a mismatch problem between training and testingof the acoustic model), therefore, by decoding the data with AM2,the WER is further reduced by 34% relatively, which is close to theresult obtained with oracle single-speaker noisy reverberant speech(last row in Table 5).In future work, we plan to exploit other speaker recognitionmodels for embedding generation, and to train these models withlarger and more challenging datasets, such as VoxCeleb [30]. More-over, we will investigate joint training of the speaker embedding andthe proposed speech extraction networks, which is expected to ben-efit both tasks [10].
5. CONCLUSIONS
In this paper, we have presented a multi-channel speech extractionsystem with a novel speaker conditioning mechanism. By introduc-ing an additional speaker branch for receiving external speaker fea-tures, this mechanism solves the problems caused by feature shar-ing from contradicting tasks and difference between multiple inputs,providing a more effective way to use the speaker information toimprove separation performance. Informed by multiple speaker em-beddings, the proposed system is able to simultaneously output cor-responding sources from a noisy and reverberant mixture, without alabel permutation ambiguity. Experiments on WHAMR! simulated2-speaker mixtures have shown that the proposed multi speaker ex-traction approach outperforms a strong blind speech separation base-line based on PIT. . REFERENCES [1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deepclustering: Discriminative embeddings for segmentation andseparation,” in
Proc. IEEE Int. Conf. on Acoustics, Speech andSignal Proc. (ICASSP) , 2016, pp. 31–35.[2] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalkerspeech separation with utterance-level permutation invarianttraining of deep recurrent neural networks,”
IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 25,no. 10, pp. 1901–1913, 2017.[3] G.-P. Yang, S.-L. Wu, Y.-W. Mao, H.-y. Lee, and L.-s. Lee,“Interrupted and cascaded permutation invariant training forspeech separation,” in
Proc. IEEE Int. Conf. on Acoustics,Speech and Signal Proc. (ICASSP) , 2020, pp. 6369–6373.[4] L. Drude, T. von Neumann, and R. Haeb-Umbach, “Deep at-tractor networks for speaker re-identification and blind sourceseparation,” in
Proc. IEEE Int. Conf. on Acoustics, Speech andSignal Proc. (ICASSP) , 2018, pp. 11–15.[5] E. Nachmani, Y. Adi, and L. Wolf, “Voice separation withan unknown number of multiple speakers,” arXiv preprintarXiv:2003.01531 , 2020.[6] K. ˇZmol´ıkov´a, M. Delcroix, K. Kinoshita, T. Ochiai,T. Nakatani, L. Burget, and J. ˇCernock`y, “SpeakerBeam:Speaker aware neural network for target speaker extraction inspeech mixtures,”
IEEE Journal of Selected Topics in SignalProcessing , vol. 13, no. 4, pp. 800–814, 2019.[7] M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita,N. Tawara, T. Nakatani, and S. Araki, “Improving speakerdiscrimination of target speech extraction with time-domainSpeakerbeam,”
Proc. IEEE Int. Conf. on Acoustics, Speech andSignal Proc. (ICASSP) , 2020.[8] M. Ge, C. Xu, L. Wang, C. E. Siong, J. Dang, and H. Li,“SpEx+: A complete time domain speaker extraction network,”
ArXiv , vol. abs/2005.04686, 2020.[9] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu,J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L.Moreno, “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in
Proc. Interspeech 2019 ,2019.[10] X. Ji, M. Yu, C. Zhang, D. Su, T. Yu, X. Liu, and D. Yu,“Speaker-aware target speaker enhancement by jointly learningwith speaker embedding extraction,” in
Proc. IEEE Int. Conf.on Acoustics, Speech and Signal Proc. (ICASSP) , 2020, pp.7294–7298.[11] J. Shi, J. Xu, Y. Fujita, S. Watanabe, and B. Xu, “Speaker-conditional chain model for speech separation and extraction,” arXiv preprint arXiv:2006.14149 , 2020.[12] J. Zhang, C. Zoril˘a, R. Doddipatla, and J. Barker, “On end-to-end multi-channel time domain speech separation in rever-berant environments,” in
Proc. IEEE Int. Conf. on Acoustics,Speech and Signal Proc. (ICASSP) , 2020, pp. 6389–6393.[13] E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo rm-rf: Efficientnetworks for universal audio source separation,” arXiv preprintarXiv:2007.06833 , 2020.[14] R. Gu, S.-X. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou,and D. Yu, “Enhancing end-to-end multi-channel speech sepa-ration via spatial feature learning,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP) , 2020, pp. 7319–7323.[15] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convo-lutional networks: A unified approach to action segmentation,”in
Proc. European Conference on Computer Vision , 2016, pp.47–54.[16] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: effi-cient long sequence modeling for time-domain single-channelspeech separation,” in
Proc. IEEE Int. Conf. on Acoustics,Speech and Signal Proc. (ICASSP) , 2020, pp. 46–50.[17] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo-lutional networks for biomedical image segmentation,” in
Proc. Int. Conf. on Medical Image Computing and Computer-Assisted Intervention , 2015, pp. 234–241.[18] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normal-ization: The missing ingredient for fast stylization,” arXivpreprint arXiv:1607.08022 , 2016.[19] N. Zeghidour and D. Grangier, “Wavesplit: End-to-endspeech separation by speaker clustering,” arXiv preprintarXiv:2002.08933 , 2020.[20] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in
Proc. IEEE Int. Conf. on Acous-tics, Speech and Signal Proc. (ICASSP) , 2019, pp. 626–630.[21] M. Maciejewski, G. Wichern, E. McQuinn, and J. L. Roux,“WHAMR!: Noisy and reverberant single-channel speech sep-aration,” in
Proc. IEEE Int. Conf. on Acoustics, Speech andSignal Proc. (ICASSP) , 2020, pp. 696–700.[22] G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn,D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extendingspeech separation to noisy environments,” in
Proc. Interspeech2019 , 2019.[23] R. Scheibler, E. Bezzam, and I. Dokmani´c, “Pyroomacous-tics: A python package for audio room simulation and arrayprocessing algorithms,” in
Proc. IEEE Int. Conf. on Acoustics,Speech and Signal Proc. (ICASSP) , 2018, pp. 351–355.[24] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[25] M. Ravanelli and Y. Bengio, “Speaker recognition from rawwaveform with SincNet,” in
Proc. IEEE Spoken LanguageTechnology Workshop (SLT) . IEEE, 2018.[26] W. Li, P. Zhang, and Y. Yan, “Target Speaker Recovery andRecognition Network with Average x-Vector and Global Train-ing,” in
Proc. Interspeech 2019 , 2019, pp. 3233–3237.[27] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmo-hammadi, and S. Khudanpur, “Semi-orthogonal low-rank ma-trix factorization for deep neural networks,”
Proc. Interspeech2018 , pp. 3743–3747, 2018.[28] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. , “The Kaldi speech recognition toolkit,” in
Proc. IEEEAutomatic Speech Recognition and Understanding Workshop(ASRU) , 2011.[29] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing idealtime–frequency magnitude masking for speech separation,”
IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 27,no. 8, pp. 1256–1266, 2019.[30] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deepspeaker recognition,” in