CatNet: music source separation system with mix-audio augmentation
aa r X i v : . [ c s . S D ] F e b CATNET: MUSIC SOURCE SEPARATION SYSTEM WITH MIX-AUDIO AUGMENTATION
Xuchen Song*, Qiuqiang Kong*, Xingjian Du, Yuxuan Wang
ByteDance, Shanghai, China { xuchen.song, kongqiuqiang, duxingjian.real, wangyuxuan.11 } @bytedance.com ABSTRACT
Music source separation (MSS) is the task of separating amusic piece into individual sources, such as vocals and ac-companiment. Recently, neural network based methods havebeen applied to address the MSS problem, and can be cat-egorized into spectrogram and time-domain based methods.However, there is a lack of research of using complementaryinformation of spectrogram and time-domain inputs for MSS.In this article, we propose a CatNet framework that concate-nates a UNet separation branch using spectrogram as inputand a WavUNet separation branch using time-domain wave-form as input for MSS. We propose an end-to-end and fullydifferentiable system that incorporate spectrogram calcula-tion into CatNet. In addition, we propose a novel mix-audiodata augmentation method that randomly mix audio segmentsfrom the same source as augmented audio segments for train-ing. Our proposed CatNet MSS system achieves a state-of-the-art vocals separation source distortion ratio (SDR) of 7.54dB, outperforming MMDenseNet of 6.57 dB evaluated on theMUSDB18 dataset.
Index Terms — Music source separation, CatNet, mix-audio data augmentation.
1. INTRODUCTION
Music source separation (MSS) is the task of separating amusic piece into individual sources, such as vocals and ac-companiment. MSS has several applications in music infor-mation retrieval [1], including melody extraction, vocal pitchcorrection and music transcription. Single-channel MSS isa challenging problem because the number of sources to beseparated is larger than the number of channels of mixtures.For example, audio recordings from the MUSDB18 dataset[2] have two channels, while the separation task is to separateaudio recordings into four channels including vocals, drums,bass and other sources.Early works of MSS include using unsupervised separa-tion methods, such as harmonic structures to separate sources[3], and using non-negative matrix factorizations (NMFs) tofactorize multi-channel or single-channel spectrograms [4, 5].Independent component analysis (ICA) has been proposed for * Equal contribution. signal separation in [6]. Recently, neural network based meth-ods have been applied to MSS, such as fully connected neuralnetworks [7, 8], recurrent neural networks (RNNs) [9, 10, 11],and convolutional neural networks (CNNs) [12, 13]. Thoseneural network based methods usually use spectrograms ofaudio recordings as input features, and build neural networksto separate the spectrograms of target sources. In additionto the spectrogram based methods, several time-domain MSSsystems have been proposed, such as WavUNet [14], Conv-TasNet [15] and Demucs [16]. Those time-domain MSS sys-tems use the waveform of audio recordings as input features,and build neural networks to separate the waveforms of targetsources. However, many of those time-domain MSS systemsdo not outperform spectrogram based MSS systems.In this article, we propose an end-to-end MSS systemcalled CatNet to combine the advantages of spectrogram andtime-domain systems for MSS, where
CatNet is the abbrevi-ation of concatenation of networks . A CatNet consists of aUNet [12] branch using spectrogram as input and a WavUNet[14] branch using time-domain waveform as input. To build aconsistent input format for the UNet branch and the WavUNetbranch, we wrap Fourier and inverse Fourier transform intodifferentiable matrix multiplication operations. Then, we sumthe UNet and WavUNet outputs in the time-domain to obtainseparated sources. Therefore, CatNet utilizes the complemen-tary information of spectrogram and time-domain systems forMSS, and is fully differentiable which can be trained in anend-to-end way.Data augmentation is important to improve the perfor-mance of MSS systems when the duration of training datais limited to hours such as the MUSDB18 dataset [2]. How-ever, previous data augmentation methods [10, 17] includeadding filters, remixing audio recordings, swapping left andright channels, shifting pitches, scaling and stretching audiorecordings have marginal improvement to the performanceof MSS systems [10, 17]. We propose a novel mix-audiodata augmentation method that randomly mix audio segmentsfrom the same source as augmented audio segments.This paper is organized as follows, Section 2 introducesneural network based MSS systems. Section 3 introduces ourproposed CatNet system. Section 4 introduces our proposedmix-audio data augmentation strategy. Section 5 shows ex-perimental results, and Section 6 concludes this work. . SOURCE SEPARATION METHODS
Previous neural network based MSS systems can be catego-rized into spectrogram and time-domain based methods. Weintroduce those systems as follows.
Recently, neural networks have been applied to MSS, includ-ing fully connected neural networks [7, 18], recurrent neuralnetworks (RNNs) [9, 10] and convolutional neural networks(CNNs) [12, 13, 19]. A neural network based MSS systemlearns a mapping from the spectrogram of a mixture to thespectrogram of a target source. To begin with, we denote themixture of an input as x , and the source to separate as s . Weapply a short time Fourier transform (STFT) F on x and s tocalculate their STFT X = F ( x ) and S = F ( s ) respectively.Both X and S are complex matrices with shapes of T × F ,where T and F are the number of frames and frequency binsrespectively. We call the magnitude of | X | spectrogram. AMSS system takes a spectrogram | X | as input, and outputs aseparated spectrogram | ˆ S | : | ˆ S | = f sp ( | X | ) , (1)where f sp is a mapping modeled by a set of learnable parame-ters. For example, f sp can be a RNN, a CNN, or a UNet [12].In this article, we model (1) with a mask based approach [20]: | ˆ S | = m ⊙ | X | , (2)where m ∈ [0 , T × F is a mask to be estimated by the RNN,CNN or UNet, and ⊙ denotes element-wise multiplication.Then, the separated STFT ˆ S can be obtained by: ˆ S = | ˆ S | e ∠ X , (3)where ∠ X is the phase of X . Finally, an inverse STFT(ISTFT) F − is applied to ˆ S to output the separated wave-form ˆ s . There are several disadvantages of spectrogram based MSSmethods. First, spectrogram based MSS systems only pre-dict the spectrogram | ˆ S | of sources, but do not estimate thephases of separated sources. Therefore, the performance ofthose MSS systems are limited. Recently, several works haveinvestigated estimating the phases of sources for MSS [21].Secondly, spectrogram may not be an optimal representationfor a time-domain signal. For example, STFT performs wellin analysing stationary signals in a short time window, whilesignals are not always stationary. Time-domain MSS systemshave been proposed to address those problems. Both the in-put and output of time-domain systems are waveforms insteadof spectrograms. Those time-domain systems have the ad-vantage of not needing to estimate phases for MSS, such as WavUNet [14], Conv-TasNet [15], and Demucs [16]. Similarto the spectrogram based MSS systems, we build a mapping f wav from the mixture signal x to separate source ˆ s : ˆ s = f wav ( x ) , (4)where f wav is a regression function modeled by a set of learn-able parameters. Equation (4) does not need to apply STFTand ISTFT for separating sources. In this article, we model f wav with a WavUNet [14].
3. CATNET FOR MUSIC SOURCE SEPARATION3.1. CatNet
The motivation of proposing CatNet is to combine the advan-tages of spectrogram and time-domain based methods. UNetarchitectures have shown to perform good in capturing thetime-frequency patterns of spectrograms for MSS. However,spectrogram may not be the optimal representation for audiorecordings. On the other hand, WavUNet has the advantageof learning adaptive kernels to represent audio recordingsin the time-domain waveform. Our proposed CatNet is de-signed to combine the advantages of UNet and WavUNet byconcatenating their time-domain outputs. CatNet consists oftwo branches. One branch is a UNet that uses differentiableoperations to calculate spectrogram, and outputs time-domainwaveforms described in Section 3.2. The UNet branch is de-signed to learn robust frequency patterns of sounds. Anotherbranch is a WavUNet that takes the waveform of a mixtureand outputs the waveform of separated sources. WavUNetprovides flexible and learnable transforms by using one-dimensional convolutions. The UNet branch and WavUNetbranch provide complementary information for MSS. We de-note the outputs of UNet and WavUNet branches as ˆ s U and ˆ s WU respectively. The output of CatNet can be written as: ˆ s = ˆ s U + ˆ s WU . (5)Equation (5) shows that the output of a CatNet is the additionof the outputs of UNet and WavUNet in the time-domain. Intraining, the parameters of UNet and WavUNet are optimizedjointly. Previous UNet system [12] uses spectrogram as inputand output, which can not be used in our end-to-end CatNetsystem. Therefore, we propose to incorporate UNet into ourend-to-end time-domain MSS system as follows. STFT can be calculated by applying discrete Fourier trans-form (DFT) on the frames of audio clips, where each framecontains several audio samples in a short time window. Anaudio recording x is split into T frames, where each framecontains N samples. We denote the t -th frame as x t , and itsDFT as X t . The calculation of DFT can be written as: X t = Dx t , (6)here D is an N × N complex matrix with elements of D kn = e − i πN kn . We decompose the DFT matrix D into areal part D R and an imaginary part D I . Then, equation (6)can be written as: X t = D R x t + iD I x t . (7)Equation (7) shows that STFT can be calculated by matrixmultiplication. We apply one-dimensional convolutional op-erations to x to parallel the calculation of STFT over severalframes: X R = conv R ( x ) and X I = conv I ( x ) , where conv R and conv I are one-dimensional convolutions with parametersof D R and D I respectively. The strides of conv R and conv I are set to the hop samples between adjacent frames. Then, theSTFT of x is calculated by X = X R + iX I . In separation,we calculate the separated spectrogram | ˆ S | using the systemdescribed in Section 2.1. Then, the estimated STFT ˆ S can beobtained by: ˆ S = | ˆ S | cos ∠ X + i | ˆ S | sin ∠ X, (8)where ∠ X is the angle of X . Similar to the calculation ofSTFT, ISTFT can be calculated by applying inverse DFT(IDFT) on the frames of ˆ S [22]. We denote the t -th frame of ˆ S as ˆ S t , and the t -th frame of time-domain IDFT as ˆ s t . Thecalculation of ˆ s t can be written as: ˆ s t = D − ˆ S t , (9)where D − is an N × N complex IDFT matrix with elementsof D kn = N e i πN kn , and N is the number of samples in aframe. We decompose the IDFT matrix D − and estimated ˆ S into real and imaginary parts: D − = D − R + iD − I and ˆ S t = ˆ S R ,t + i ˆ S I ,t . Considering the reconstructed signal is areal signal, the signal ˆ s t can be reconstructed by: ˆ s t = D − R ˆ S R,t − D − I ˆ S I,t . (10)Similar to the calculation of STFT, we apply one-dimensionalconvolutional operations to ˆ S to parallel the reconstruction oftime-domain signals ˆ s ,...,T = { ˆ s , ..., ˆ s T } : ˆ s ,...,T = conv − R ( ˆ S R ) − conv − I ( ˆ S I ) , (11)where conv − R and conv − I are one-dimensional convolutionswith parameters of D − R and D − I respectively. Finally, atransposed convolution with a stride equivalent to hop sam-ples between adjacent frames is used to reconstruct the time-domain waveform with overlap-add: ˆ s = tconv (ˆ s ,...,T ) . (12)All STFT, ISTFT and overlap-add operations are built into theend-to-end CatNet, and are fully differentiable.
4. MIX-AUDIO DATA AUGMENTATION
The amount of training data is important for training MSSsystems. The duration of public available MSS datasets suchas MUSDB18 [23, 2] is limited to a few hours. Data aug-mentation is a technique to increase the variety of trainingdata. Previous works have applied several data augmenta-tion methods for MSS [10, 17] including adding filters tosongs, remixing sources from different songs, swapping leftand right channels, shifting pitches, scaling and stretching au-dio recordings with random amplitudes [10, 17]. However,those data augmentation methods have marginal influence ontraining MSS systems. In this work, we propose a novel mix-audio data augmentation technique for MSS. Different fromprevious remixing instruments method [10, 17] that remix in-struments from different songs, our proposed mix-audio dataaugmentation method randomly mixes two audio segmentsfrom a same source as an augmented segment for that source: s mix = J X j =1 s j , (13)where s j , j = 1 , ..., J are audio segments from the samesource, and J is the number of sources to be mixed. We de-note the mixed source as s mix . If s j are vocals, then, theiraddition s mix is also vocals. The mix-audio data augmenta-tion provides a large amount combinations of one source. Wedenote s ( i ) mix as the i -th source, and I as the number of sourcesto separate. The input mixture to a CatNet is calculated by: x mix = I X i =1 s ( i ) mix , (14)To explain, the MSS turns to a problem of constructing amapping f from x mix to separate ˆ s mix by using the CatNetdescribed in (5). With mix-audio data augmentation, separat-ing ˆ s mix from x mix is a more challenging problem than thatwithout mix-audio data augmentation. The effectiveness ofmix-audio data augmentation comes from that the addition ofseveral audio segments from a same source also belongs tothat source. Intuitively, a system that is able to separate mul-tiple vocals from a mixture is also able to separate a singlevocal from the mixture.
5. EXPERIMENTS
We experiment our proposed CatNet system on the MUSDB18dataset [2, 23]. The MUSDB18 dataset consists of 150 fulllength music tracks of different genres including isolated vo-cals, drums, bass and other sources. The training and testingset contains 100 and 50 songs respectively. All music record-ings are stereophonic with a sampling rate of 44.1 kHz. Intraining, music pieces are split into 3-second audio segments. able 1 . SDRs of previous MSS systemsVocals Drums Bass OtherWavUNet [14] 3.05 4.16 3.17 2.24Demucs [16] 6.21 6.50 6.21 3.80MMDenseNet [27] 6.57 6.40 5.14 4.13DAE [28] 5.74 4.66 3.67 3.40CatNet + aug 7.54 5.85 5.01 4.67Random track mixing is used in our systems as a default set-ting. For mix-audio data augmentation, we randomly selectaudio segments from a source to obtain s mix as described in(13). Then, we randomly mix s mix from different sources toconstitute the mixture x mix as input described in (14).A CatNet consists of a UNet branch and a WavUNetbranch. For the UNet branch, spectrogram is extracted usingone-dimensional convolutions described in Section 3.2 witha Hann window size 2048 and a hop size 441. The UNetbranch consists of six encoding and six decoding blocks.Each encoding block consists of two convolutional layerswith kernel sizes of × and an × average pooling layer.Each convolutional layer consists of a linear convolution, abatch normalization [24] and a ReLU [25] nonlinearity. Eachdecoder layer consists of two convolutional layers that aresymmetric to the encoder. In UNet, the output of each en-coder is concatenated with the output of the encoder fromthe same hierarchy, and is input to a transposed convolutionallayer with a kernel size × and a stride × to upsam-ple the feature maps. The numbers of channels of encoderblocks are 32, 64, 128, 256, 512 and 1024, and the number ofchannels of decoder blocks are symmetric to encoder blocks.The WavUNet branch consists of six encoder blocks and sixdecoder blocks. The kernel size of all one-dimensional con-volutional layers are 3. The average pooling layers have sizesof 4, and the transpose convolutional layers have strides of4. The number of channels in the encoder blocks are 32, 64,128, 256, 512 and 1024, and decoder blocks are symmetricto encoder blocks. We use an Adam optimizer [26] with alearning rate of 0.001, and a batch size of 12 for training. Weapply mean absolute error loss function k s − ˆ s k betweenestimated and target time-domain waveforms for training. We evaluate the MSS performance with source to distortionratio (SDR) [29] using the BSSEval V4 toolkit [23]. MedianSDR is used for evaluating different MSS systems [23]. Ta-ble 1 shows the results of previous MSS systems. The time-domain system WavUNet [14] achieves a SDR of 3.05 dB invocals separation, and is improved by Demucs [16] with a me-dian SDR of 6.21 dB. The spectrogram based method MM-DenseNet [27] outperforms the time-domain methods, witha median SDR of 6.57 dB. A denoising autoencoder (DAE)
Table 2 . SDRs of our proposed MSS systems with mix-audiodata augmentation.
Vocals Acc. Drums Bass OtherUNet (sp) 7.04 15.06 5.72 4.28 4.38WaveUNet 5.57 12.68 5.03 5.42 3.38WaveUNet + aug 6.08 13.43 5.15 4.65 3.49UNet (wav) 7.13 15.47 5.96 4.37 4.50UNet (wav) + aug 7.17 15.20 5.86 3.79 4.18CatNet 6.96 15.13 5.91 6.10 4.99CatNet + aug 7.54 15.18 5.85 5.01 4.67 system [28] achieves a SDR of 5.74 dB. The bottom rowshows that our proposed CatNet with mix-audio augmenta-tion achieves a state-of-the-art vocals separation SDR of 7.54,outperforming other systems.All systems in Table 2 are our re-implemented systems.The first row of Table 2 shows that UNet system [12] trainedwith spectrogram loss achieves a vocals SDR of 7.04 dB andan accompaniment SDR of 15.06 dB. The second row showsthat WavUNet [14] achieves a vocals separation SDR of 5.57dB, underperforming the spectrogram based method. The text“+ aug” indicates with mix-audio data augmentation. Thethird row shows that with mix-audio data augmentation, thevocals SDR is improved to 6.08 dB. The fourth row showsthat our proposed spectrogram based UNet trained with time-domain MAE loss improves the vocals separation SDR to7.13 dB, and improves the SDR to 7.17 dB with mix-audiodata augmentation. The seventh row shows that our proposedCatNet with mix-audio data augmentation achieves a state-of-the-art vocals separation SDR of 7.54 dB, largely outperform-ing other systems, indicating the effectiveness of vocals sep-aration with mix-audio data augmentation. Our system alsoachieves accompaniment, drums, bass and other separationSDRs of 15.18 dB, 5.85 dB, 5.01 dB and 4.67 dB respectively.
6. CONCLUSION
We propose an end-to-end CatNet for music source separa-tion. The CatNet consists of a UNet branch and a WavUNetbranch to combine the advantages of both spectrogram andtime-domain MSS systems. CatNet is fully differentiable andcan be trained in an end-to-end say. We propose a novel mix-audio data augmentation method that randomly mix audiosegments from the same source as augmented audio segment.Our proposed CatNet with mix-audio data augmentation sys-tem achieves a state-of-the-art vocals separation SDR of 7.54dB, and an accompaniment separation SDR of 15.18 dB onthe MUSDB18 dataset. In future, we will investigate usingCatNets for general source separation. . REFERENCES [1] M. A. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes,and M. Slaney, “Content-based music information retrieval:Current directions and future challenges,”
Proceedings of theIEEE , vol. 96, no. 4, pp. 668–696, 2008.[2] Z. Rafii, A. Liutkus, F. St¨oter, S. I. Mimilakis, and R. Bittner,“The MUSDB18 corpus for music separation,” 2017.[3] Z. Duan, Y. Zhang, C. Zhang, and Z. Shi, “Unsupervisedsingle-channel music source separation by average harmonicstructure modeling,”
IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 16, no. 4, pp. 766–778, 2008.[4] A. Ozerov and C. F´evotte, “Multichannel nonnegative matrixfactorization in convolutive mixtures for audio source separa-tion,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 18, no. 3, pp. 550–563, 2009.[5] F. Weninger, J. L. Roux, J. R. Hershey, and S. Watanabe, “Dis-criminative nmf and its application to single-channel sourceseparation,” in
INTERSPEECH , 2014.[6] M. E. Davies and C. J. James, “Source separation using singlechannel ICA,”
Signal Processing , vol. 87, no. 8, pp. 1819–1832, 2007.[7] E. M. Grais, M. U. Sen, and H. Erdogan, “Deep neuralnetworks for single channel source separation,” in
Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2014, pp. 3734–3738.[8] Z. Rafii, A. Liutkus, F. St¨oter, S. I. Mimilakis, D. FitzGerald,and B. Pardo, “An overview of lead and accompaniment sep-aration in music,”
IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 26, no. 8, pp. 1307–1335, 2018.[9] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis,“Joint optimization of masks and deep recurrent neural net-works for monaural source separation,”
IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 23, no.12, pp. 2136–2147, 2015.[10] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Taka-hashi, and Y. Mitsufuji, “Improving music source separationbased on deep neural networks through data augmentation andnetwork blending,” in
International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2017, pp. 261–265.[11] F. St¨oter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, “Open-unmix-a reference implementation for music source separa-tion,” 2019.[12] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Ku-mar, and T. Weyde, “Singing voice separation with deep U-Netconvolutional networks,” in
International Society for Music In-formation Retrieval (ISMIR) , 2017.[13] R. Hennequin, A. Khlif, F. Voituret, and M. Moussalam,“Spleeter: A fast and state-of-the art music source separationtool with pre-trained models,” in
Late-Breaking/Demo of Inter-national Society for Music Information Retrieval Conference(ISMIR) , 2019.[14] D. Stoller, S. Ewert, and S. Dixon, “Wave-U-Net: A multi-scale neural network for end-to-end audio source separation,”
International Society for Music Information Retrieval (ISMIR) ,2018. [15] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing idealtime–frequency magnitude masking for speech separation,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, no. 8, pp. 1256–1266, 2019.[16] A. D´efossez, N. Usunier, L. Bottou, and F. Bach, “Demucs:Deep extractor for music sources with extra unlabeled dataremixed,” arXiv preprint arXiv:1909.01174 , 2019.[17] L. Pr´etet, R. Hennequin, J. Royo-Letelier, and A. Vaglio,“Singing voice separation: A study on training data,” in
In-ternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , 2019, pp. 506–510.[18] A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel au-dio source separation with deep neural networks,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol.24, no. 9, pp. 1652–1664, 2016.[19] Y. Liu, B. Thoshkahna, A. Milani, and T. Kristjansson,“Voice and accompaniment separation in music using self-attention convolutional neural network,” arXiv preprintarXiv:2003.08954 , 2020.[20] D. Wang and J. Chen, “Supervised speech separation basedon deep learning: An overview,”
IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 26, no. 10, pp.1702–1726, 2018.[21] N. Takahashi, P. Agrawal, N. Goswami, and Y. Mitsufuji,“Phasenet: Discretized phase modeling with deep neural net-works for audio source separation.,” in
INTERSPEECH , 2018,pp. 2713–2717.[22] Y. Wang and D. Wang, “A deep neural network for time-domain signal reconstruction,” in
International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 2015, pp.4390–4394.[23] F. St¨oter, A. Liutkus, and N. Ito, “The 2018 signal separationevaluation campaign,” in
International Conference on LatentVariable Analysis and Signal Separation . Springer, 2018, pp.293–305.[24] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” in
International Conference on Machine Learning (ICML) , 2015.[25] V. Nair and G. E. Hinton, “Rectified linear units improve re-stricted Boltzmann machines,” in
International Conference onMachine Learning (ICML) , 2010.[26] Diederik P Kingma and Jimmy Ba, “Adam: A methodfor stochastic optimization,” in
International Conference onLearning Representations (ICLR) , 2015.[27] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-banddensenets for audio source separation,” in
Workshop on Ap-plications of Signal Processing to Audio and Acoustics (WAS-PAA) , 2017, pp. 21–25.[28] J. Liu and Y. Yang, “Denoising auto-encoder with recurrentskip connections and residual regression for music source sep-aration,” in
International Conference on Machine Learningand Applications (ICMLA) , 2018, pp. 773–778.[29] E. Vincent, R. Gribonval, and C. F´evotte, “Performance mea-surement in blind audio source separation,”