TransMask: A Compact and Fast Speech Separation Model Based on Transformer
TTRANSMASK: A COMPACT AND FAST SPEECH SEPARATION MODEL BASED ONTRANSFORMER
Zining Zhang , , Bingsheng He , Zhenjie Zhang PVoice Technology Singapore School of Computing, National University of Singapore
ABSTRACT
Speech separation is an important problem in speech pro-cessing, which targets to separate and generate clean speechfrom a mixed audio containing speech from different speak-ers. Empowered by the deep learning technologies oversequence-to-sequence domain, recent neural speech sepa-ration models are now capable of generating highly cleanspeech audios. To make these models more practical by re-ducing the model size and inference time while maintaininghigh separation quality, we propose a new transformer-basedspeech separation approach, called TransMask. By fully un-leashing the power of self-attention on long-term dependencyreception, we demonstrate the size of TransMask is morethan 60% smaller and the inference is more than 2 timesfaster than state-of-the-art solutions. TransMask fully uti-lizes the parallelism during inference, and achieves nearlylinear inference time within reasonable input audio lengths.It also outperforms existing solutions on output speech audioquality, achieving SDR above 16 over Librimix benchmark.
Index Terms — speech separation, transformer, deeplearning
1. INTRODUCTION
It is usually difficult to separate clean audio speeches fromthe audios in the real world. In practice, speech process-ing systems are supposed to handle noisy audio when back-ground musics or even speeches from different speakers arepresent in the audio clip. It is therefore crucial to separatethe speech from different speakers, before proceeding to fur-ther processing and analysis, such as ASR (automatic speechrecognition).With the explosive development of deep learning, recentneural speech separation models, such as TasNet [1], Conv-TasNet [2], deep CASA [3] and DPRNN [4], have achievedsignificant quality improvement over traditional approaches.The common strategy used in these approaches is to maskthe time or frequency domain, with the masks coming from adeep neural network over the audio signals. The clean audiosare then generated by reconstructing the signals based on themasks corresponding to individual speakers. As two common types of neural network, both Convolutional Neural Network(CNN) and Recurrent Neural Network (RNN) are already em-ployed in the generation of the masks. Recently, self-attentionmodels, such as Transformer and its variants [5, 6, 7], are per-forming particularly well in sequence-to-sequence domains,in applications such as neural translation and speech recogni-tion, because of the power of self-attention structure on main-taining long-term dependency. Such advantages are believedto be beneficial to speech separation tasks as well, since thecommon challenges in speech separation, such as channelswap, could be dissolved when long-term dependency is wellcaptured.Moreover, existing RNN-based speech separation mod-els do not scale well with audio length in model training andinference. The theoretical bound of inference time, for ex-ample, is linear to the length of the audio representations.CNN-based speech separation models are faster, while theoutput quality of these models are outperformed by RNN-based models because of their limited receptive fields. Trans-formers, on the other hand, have the potential to overcomethe limitations of both types of models, when 1) self-attentionnaturally has the maximal receptive field; and 2) the inferencecan be easily paralleled. This means that the cost of sequentialoperations is nearly constant with sufficient resources, and thequality of Transformer-based models is expected to be com-parable or even better than RNN models.However, as self-attention does not pose auto-regressiveregularization as RNNs do, it may suffer from the lack ofshort-term dependencies. Therefore, it remains difficult todesign a Transformer-based speech separation model meet-ing all the expectations above. Some attempts of transformer-based speech separations models [8, 9, 10] try to solve theproblem by either introducing RNN on every transformer lay-ers or using large transformer model which sacrifice the ef-ficiency. This motivates us to develop a compact, fast, yeteffective speech separation model with the help of the follow-ing designs:
STRNN , Sandwich-Norm Transformer Layer and
Dual-Temporal Convolutional Encoding . As a summary, thecore contributions of this paper include:1. We propose a new model, called TransMask, incurringnearly constant inference time cost; a r X i v : . [ c s . S D ] F e b . We design TransMask to be the smallest deep learn-ing model for speech separation in the literature, 60%smaller than state-of-the-art solution;3. We evaluate TransMask on LibriMix benchmark anddemonstrate outstanding SDR performance.
2. RELATED WORK
The speaker permutation problem is a significant problem inspeech separation. Based on the solution to this problem, wecategorize the speech separation neural models into two cat-egories: deep clustering and permutation invariant training(PIT). We focus on the prevailing PIT-based models in thispaper.PIT is to calculate losses over all permutations of the out-puts for different speakers, and uses the smallest one as theobjective for optimization. There are frame level PIT (tPIT)and utterance level PIT (uPIT) [11]. uPIT is prevailing sincetPIT is prone to the problem of channel swap.TasNet [1] isone of the successful models using uPIT. It deploys stackedLSTM as the separation module. Conv-TasNet [2] adopts asimilar architecture, but it makes use of Temporal Convolu-tional Net (TCN) instead of RNN. It achieves better perfor-mance by making the neural network deeper with the help ofTCN. DPRNN [4] uses RNN modules, but with a dual-pathprocedure, which helps model to achieve better performancethan Conv-TasNet.Most of the existing studies of speech separation is basedon the time-frequency representation of the original audios.Instead, TasNet model and its variants apply a trainable en-coder and decoder directly over time domain. The encoderproduces a 2D representation from the temporal sequences,and the decoder convert this time-frequency-like representa-tion back to temporal sequences. The separation module op-erates on this 2D representation and produces the masks ofclean sources.There are also other applications of transformers. DPT-Net [8] uses similar structure as DPRNN, but replaces theRNN modules with transformers. Note that the transform-ers in DPTNet utilize RNN as feed-forward network, whichmeans that DPTNet is still an auto-regressive model. Thusthe performance improvement of DPTNet may come from theincreased model complexity. Sepformer [9] uses pure trans-formers but the model size is at least 10 times larger thanTransMask. Conformer [10] adopts the structure which isproved to be successful in speech recognition. This workmainly focuses on the Continuous Speech Separation (CSS)scenario.
3. PRELIMINARIESTransformer:
Original transformer [5] is mainly used in nat-ural language processing tasks like language model, or neural machine translation. The core component of transformer isthe self-attention module followed by a feed forward network.Each element of the sequence attends to all the elements of thesame sequence.Transformer is able to capture the dependency from ar-bitrary long distance, due to this self-attention mechanism.Although it is better than CNN or RNN in capturing long-term dependencies, it may overlook local dependencies, be-cause of the lack of positional information. It is therefore im-portant to introduce positional encoding into Transformer tohelp the model enhance with local dependency information,Given positional encoding, it still needs deep transformer lay-ers and plenty of training for the model to fully capture thelocal dependency. To overcome this drawback, some studies,e.g., R-transformer [12], inject RNN modules into transform-ers; others [7, 6] attend only to a local window or elementsfrom strided distances. Different from such methods, we usea strided sparse transformer [7] only for handling the long-term dependencies.
DPRNN:
DPRNN [4] is a variant of TasNet [1]. They sharea similar architecture, consisting of an encoder, a separatorand a decoder. The encoder and the decoder can be viewed astrainable substitute of STFT and inversed-STFT as discussedin last section. DPRNN uses a dual path process for inte-grating RNN into the separator. The model first splits the se-quence into overlapping chunks, and then performs two pathsof RNN. It is done within the chunks first, and across thechunks later. The input audio with L frames is first padded tobe divisible by P . Let P be the length of the chunk ( P is thenumber of overlapping frames of the two consecutive chunks)and S is the number of chunks. The intra-chunk RNN is usedfor capturing the local dependencies. And it is applied on thedimension with length P . The inter-chunk RNN is used tohandle long-term dependencies. It is applied on the dimen-sion with length S . After applying this dual-path RNN multi-ple times, DPRNN gets a promising result on speech separa-tion tasks. For DPRNN, the inference efficiency depends onthe length of the audio, denoted by N . If the chunk size isfixed, the time complexity of DPRNN ’s inference is linear tothe audio length, i.e., O ( N ) .
4. TRANSMASK
In order to utilize the capability of transformers on handlinglong-term dependencies, while keeping the auto-regressiveregularization which provides important local dependencyinformation, TransMask adopts and enhances the ideas ofstrided sparse transformer and the dual-path process fromDPRNN.
STRNN:
The mixture audio first goes through a chunk levelprocess in a bidirectional-LSTM layer, and then it is passedto a strided transformer structure for an inter-chunk atten-tion process. This architecture connects each frame of thesequence to two kinds of contexts: local context and strided ncoder DecoderChannel: 1 Length: L Strided Length: L'Channel: C
Mixture waveform SeparatedwaveformsInternal 2Drepresentation MasksSeparator
RNNTransformerSTRNN
Fig. 1 . The overall architecture of TransMask: Transformer is run over a group of RNNs in order to better capture both localand global dependency over the time domain.context respectively. The local context is processed by RNN,and the strided context is handled by the transformer. We callthis structure a strided-transformer with RNN, or STRNN inshort. The difference among STRNN, DPRNN and stridedsparse transformer is illustrated in Figure 2. The grey cell inthe graph is the current frame the model is processing. Theorange cells are the frames connected to the current frameby RNN modules. The green cells are the frames connectedto the current frame by self-attention modules. Figure 2(a)shows that DPRNN connects the local context and stridedcontext to the current frame by RNN modules. Figure 2(b)shows that strided sparse transformers connect the contextsusing only self-attention. In Figure 2(c), our STRNN strategyworks in a different way. It connects the strided contexts us-ing self-attention while using RNN for local contexts. Theoverall architecture is shown in Figure 1, which is similarto DPRNN except that the separator module is replaced bya stack of STRNN layers.Due to transformer’s strong capability on capturing thedependencies over the whole sequence, the proposed modelis expected to achieve better results than DPRNN while us-ing fewer parameters. Regarding inference, the self-attentionmodule of STRNN makes the calculation easily paralleled,and the RNN module runs over the sequences with a fixedchunk size. Therefore, the cost of sequential operations(mainly RNN) of STRNN is O(1), which ensures the promis-ing inference efficiency of TransMask.
S 2P (a) DPRNN
S 2P (b) Sparse transformer
S 2P (c) STRNN
Fig. 2 . Examples for demonstrating the difference on process-ing strategies in STRNN of TransMask, DRPNN and SparseTransformer.
LN1 MHALN2 FC GELU FCLN3
Fig. 3 . Pre-norm structure used in the proposed model. LNrepresents layer normalization, MHA is multihead attention,and FC is fully connected layers. We use GELU [15] as theactivation function.
Sandwich-Norm Transformer Layer:
To train a goodTransformer model, it is important to set up the warm-upsteps carefully, where the learning rate gradually increases.The number of warm-up steps needs to be tuned, and thismakes transformer hard to train. However, recent work [13]shows that the warm-up may not be necessary, since a changeof the order of normalization layers could solve this problem.When using the pre-norm, warm-up steps become unnec-essary. Our pre-norm transformer layer is shown in Figure3. An additional layer normalization is added to the end toavoid the whole transformer layer being by-passed, as [14]does. We call this a sandwich-norm transformer layer. Weempirically found that this structure significantly improvesthe convergence speed.
Dual-Temporal Convolutional Encoding:
When usingtransformers, the self-attention module treats all the keysequally, and this attention process is order-agnostic. Toinject position information into the self-attention process,Transformers concatenate positional encoding into the modelinputs. There are different options for positional encoding,as listed below: 1)
Sinusoid positional encoding:
It uses si-nusoid functions for representing different positions in thesequence [5]. This encoding can be used for either absoluteposition or relative positions. 2)
Frame stacking: instead ofusing only one frame per position in the input, this methodstacks n contextual frames together and creates a new framefrom it [14]. This is mainly used as a relative positional en-coding. 3) Convolutional encoding:
It uses a convolutionaleural network to encode the input [16]. This is similar toframe stacking, since after this encoding, each frame containsthe contextual information. The difference is that this schememakes the positional encoding trainable. Recent study onspeech recognition with transformer proves that convolu-tional encoding works better than other two methods [14].Thus, TransMask chooses the convolutional encoding.Different from the 2D CNN modules used in [14], ourmethod does not perform the convolution on the time dimen-sion and the filter bank dimension. Instead, it splits the inputsequence into overlaped chunks, and performs 2D convolu-tion over the intra-chunk dimension and inter-chunk dimen-sion. Since the 2D convolution is performed on the two tem-poral dimensions, we call this a dual-temporal convolutionalencoding. Each frame of the positional encoding contains notonly the positional information on local contexts, but also thepositional information on strided contexts.
Loss Function:
For the training objectives, we use theSI-SNR with utterance-level Permutation Invariant Training(uPIT) as used in [4].
5. EXPERIMENTSDataset:
Existing studies on speech separation usually usea mixture version of Wall Street Journal (WSJ0), known asWSJ0-2mix [17]. However, WSJ0 only contains 101 differ-ent speakers, and 25 hours of training data. The number ofspeakers is too small for the evaluation of generalization abil-ity of the models. LibriMix [18] is recently proposed as anew benchmark using open source dataset LibriSpeech [19].It contains 1,172 speakers, and 465 hours of training data. Inthis paper, we use the version of Librimix with 2 speakers’mixtures, Libri2Mix. And we only use the part of train-360 from the LibriSpeech dataset. The codes for mixture gen-eration are available on github https://github.com/JorisCos/LibriMix . Model specifications:
We test the proposed model with 4and 6 layers of STRNN modules, denoted as TransMask-4and TransMask-6 with learnable time-frequency basis. Thedimension size of the strided transformer is set at 64 for self-attention. The feed forward network has 256 nodes in thehidden layer. For the convolutional encoding, it contains 3layers of blocks, each consisting of a convolutional networkwith kernel size 3, global layer normalization and GELU ac-tivation. The numbers of input channels and output chan-nels are both 64. The output of the separation module ispassed to sigmoid activation function. The result is multipliedwith the input for getting the learnable basis representationof the clean source predictions. The code is open-sourced: https://github.com/Speech-AI/SpeechX . It isbased on the asteroid [20] project.
Experiment results:
We compare our model with DPRNNin three aspects: SDR, model size and inference speed. Asshown in Table 1, TransMask-6 achieves the best SDR at 16.3, outperforming DPRNN by a significant margin. TransMask-4 and TransMask-6 is 50% and 40% smaller than DPRNN,respectively. The inference speed is also at least two timesfaster. In the original paper of DPRNN, it claims to achieve O ( √ N ) processing time ( N stands for the sequence length)if it carefully sets the chunk size and the number of chunksto be close to each other. However, this can only be con-trolled during training by fixing the length of the input audio.When inferencing, the length of the input audio is not control-lable. Since the chunk size is a constant, the inference timeof DPRNN remains O ( N ) in practice. In comparison, ourproposed model always uses constant time due to the paral-lel attribute of self-attention, as long as the inference devicehas enough resources. The results are reported in Table 2.They are tested on Ksyun Virtual Senior CPU (ksyun-cpu64-senior). We use real time factor (rtf) as the metric. We aug-ment the test dataset by simply repeating the audio multipletimes. As the length of the audio goes to 4 times, DPRNN hasrelatively stable rtf, while rtf of TransMask-6 drop linearlysince the inference is in constant time. The ratio shows that, asthe audio lengths increase, the advantage of TransMask getsmore significant. It is even 4 times faster than DPRNN whenwe expand the test audios 4 times longer. This phenomenoncontinues until the CPU resources reaches its limit for the par-allel computation, where the audio lengths are expanded to 8times. Table 1 . Model size and SDR comparison between previouswork and the proposed modelModel size SDRConv-TasNet 5.1M 13.5DPRNN 2.6M 15.6TransMask-4 1.24M 15.5TransMask-6 1.62M 16.3
Table 2 . The real time factor (rtf) when we expand the lengthof the test audio, tested using CPU on Ksyun cloud serveroriginal length 2x 4x 8xDPRNN 1.08 0.84 0.69 0.70TransMask-6 0.56 0.31 0.17 0.21Ratio 51.9% 36.9% 24.6% 30.0%
6. CONCLUSION
In this paper, we combine the strength of Transformer andRNN into the architecture for better long-term and short-termdependency handling. We demonstrate that Transformer iseffective on reducing the size of the model, improving theinference efficiency, and maintaining or even improving thequality of output speech audio. . REFERENCES [1] Yi Luo and Nima Mesgarani, “Tasnet: time-domainaudio separation network for real-time, single-channelspeech separation,” in . IEEE, 2018, pp. 696–700.[2] Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpass-ing ideal time–frequency magnitude masking for speechseparation,”
IEEE/ACM transactions on audio, speech,and language processing , vol. 27, no. 8, pp. 1256–1266,2019.[3] Yuzhou Liu and DeLiang Wang, “Divide and conquer:A deep casa approach to talker-independent monauralspeaker separation,”
IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 27, no. 12, pp.2092–2102, 2019.[4] Yi Luo, Zhuo Chen, and Takuya Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in
ICASSP2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE,2020, pp. 46–50.[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in
Ad-vances in neural information processing systems , 2017,pp. 5998–6008.[6] Iz Beltagy, Matthew E Peters, and Arman Cohan,“Longformer: The long-document transformer,” arXivpreprint arXiv:2004.05150 , 2020.[7] Rewon Child, Scott Gray, Alec Radford, and IlyaSutskever, “Generating long sequences with sparsetransformers,” arXiv preprint arXiv:1904.10509 , 2019.[8] Jingjing Chen, Qirong Mao, and Dong Liu, “Dual-path transformer network: Direct context-aware model-ing for end-to-end monaural speech separation,” arXivpreprint arXiv:2007.13975 , 2020.[9] Cem Subakan, Mirco Ravanelli, Samuele Cornell,Mirko Bronzi, and Jianyuan Zhong, “Attention isall you need in speech separation,” arXiv preprintarXiv:2010.13154 , 2020.[10] Sanyuan Chen, Yu Wu, Zhuo Chen, Jinyu Li, ChengyiWang, Shujie Liu, and Ming Zhou, “Continuousspeech separation with conformer,” arXiv preprintarXiv:2008.05773 , 2020. [11] Morten Kolbæk, Dong Yu, Zheng-Hua Tan, and JesperJensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrentneural networks,”
IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 25, no. 10, pp.1901–1913, 2017.[12] Zhiwei Wang, Yao Ma, Zitao Liu, and Jiliang Tang, “R-transformer: Recurrent neural network enhanced trans-former,” arXiv preprint arXiv:1907.05572 , 2019.[13] Toan Q Nguyen and Julian Salazar, “Transformerswithout tears: Improving the normalization of self-attention,” arXiv preprint arXiv:1910.05895 , 2019.[14] Yongqiang Wang, Abdelrahman Mohamed, Due Le,Chunxi Liu, Alex Xiao, Jay Mahadeokar, HongzhaoHuang, Andros Tjandra, Xiaohui Zhang, Frank Zhang,et al., “Transformer-based acoustic modeling for hybridspeech recognition,” in
ICASSP 2020-2020 IEEE In-ternational Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 6874–6878.[15] Dan Hendrycks and Kevin Gimpel, “Gaussian errorlinear units (gelus),” arXiv preprint arXiv:1606.08415 ,2016.[16] Abdelrahman Mohamed, Dmytro Okhonko, and LukeZettlemoyer, “Transformers with convolutional contextfor asr,” arXiv preprint arXiv:1904.11660 , 2019.[17] John R Hershey, Zhuo Chen, Jonathan Le Roux, andShinji Watanabe, “Deep clustering: Discriminative em-beddings for segmentation and separation,” in . IEEE, 2016, pp. 31–35.[18] Joris Cosentino, Manuel Pariente, Samuele Cornell, An-toine Deleforge, and Emmanuel Vincent, “Librimix:An open-source dataset for generalizable speech sepa-ration,” arXiv preprint arXiv:2005.11262 , 2020.[19] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an asr corpus based onpublic domain audio books,” in . IEEE, 2015, pp. 5206–5210.[20] Manuel Pariente, Samuele Cornell, Joris Cosentino,Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaem-per, Michel Olvera, Fabian-Robert St¨oter, Mathieu Hu,Juan M. Mart´ın-Do˜nas, David Ditter, Ariel Frank, An-toine Deleforge, and Emmanuel Vincent, “Asteroid: thePyTorch-based audio source separation toolkit for re-searchers,” in