Channel-wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music
CChannel-wise Subband Input for Better Voice and Accompaniment Separationon High Resolution Music
Haohe Liu, Lei Xie ∗ , Jian Wu, Geng Yang Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,Northwestern Polytechnical University, Xian, China [email protected], { lxie,jianwu,gengyang } @nwpu-aslp.org Abstract
This paper presents a new input format, channel-wise subbandinput (CWS), for convolutional neural networks (CNN) basedmusic source separation (MSS) models in the frequency do-main. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weightsharing between distinctly different bands. Specifically, in thispaper, we decompose the input mixture spectra into severalbands and concatenate them channel-wise as the model in-put. The proposed approach enables effective weight sharingin each subband and introduces more flexibility between chan-nels. For comparison purposes, we perform voice and accompa-niment separation (VAS) on models with different scales, archi-tectures, and CWS settings. Experiments show that the CWSinput is beneficial in many aspects. We evaluate our methodon musdb18hq test set, focusing on SDR, SIR and SAR met-rics. Among all our experiments, CWS enables models to ob-tain 6.9% performance gain on the average metrics. With evena smaller number of parameters, less training data, and shortertraining time, our MDenseNet with 8-bands CWS input still sur-passes the original MMDenseNet with a large margin. More-over, CWS also reduces computational cost and training time toa large extent.
Index Terms : voice and accompaniment separation, deeplearning, subband, music source separation
1. Introduction
Music Source Separation (MSS) has raised much interest in re-cent years. The goal of the task is blindly separate sources froma mixed track, for example vocal, drums, bass and accompa-niment. In this paper, we particularly focus on the voice andaccompaniment separation (VAS) from a mixture. As a practi-cal tool, separating these two components allows us to remix,suppress or up-mix the sources [1]. VAS can also facilitate au-tomatic transcription, karaoke track generating as well as musicinformation retrieval [2].High-resolution music usually sounds better but suffersfrom high computational cost in the VAS task. For exam-ple, 44.1kHz is a commonly used sample rate for music, whilemany high-quality formats may be up to 48kHz or even higher.However, due to the high-computational cost, many of the cur-rent VAS studies perform downsampling in advance. For in-stance, the approach using M-U-Net [3] downsamples the au-dio to 10.88kHz before processing and Dense-Unet only workson 16kHz music in [4]. The downsampling process seriouslyaffects the auditory quality to the separated vocal and accompa-niment in practical applications. *Corresponding Author
Convolutional Neural Networks (CNN) has shown tremen-dous success in multiple fields, especially image-related tasks.The input data for these tasks, such as image classification,usually have problem that the position of a certain object isnot fixed. Mechanisms like local receptive fields and sharedweights [5] enable CNN to become position invariant, whichmeans once a feature has been detected, its exact location be-comes less important [5]. In audio processing, most of the state-of-the-art (SOTA) MSS models are also based on convolutionalnetworks, like Deep-Unet [6], which have shown considerableimprovements over the traditional methods.Although CNN-based architecture has demonstrated effec-tiveness on MSS tasks in frequency domain, it still has appar-ent limitations. Frequency spectrogram based SOTA modelstrained on high-resolution audio, e.g. TFC-TIF [7], usually takethe whole spectrogram as the input feature. In this case, theyassume each frequency band has filter parameters to share andare equally important. However, local patterns are usually dif-ferent between bands [8], as can be seen in Fig 1. This meansdifferent bands do not necessarily need the same set of filters(in CNN) for parameter sharing. Hence, treating different fre-quency bands differently might better facilitate the separationprocess.Figure 1:
Comparison between different bands. Lower fre-quency band contains more energy, long sustained sound, fun-damental frequency and harmonic series, while higher band,mostly percussive signals and low-energy resonance, containsless energy and less complex information.
Some prior efforts have already emphasized the differencebetween bands. In [9], Taghia et al. first took the subband de-composition, and then they used a hybrid system of empiricalmode decomposition [10] and principle component analysis toconstruct artificial observations from the single mixture. Finallya synthesis process was used to reconstruct full band signal.Takahashi et al. [11] also noticed the problem of global kernelsharing. They pointed out that the global weight sharing workswell on natural photos, in which local pattern appears in any a r X i v : . [ ee ss . A S ] A ug osition of the input [8]. But this is not the case for audio. Tohandle this problem, they designed the dedicated MDenseNetsfor each frequency bands and a full band MDenseNet for fullband rough structure [8]. This model achieved the state-of-the-art performance on SiSEC 2016 competition [12].In this paper, we propose a new input format for the MSSmodel in the frequency domain, namely channel-wise subband(CWS) input. Different from the band-dedicated approach in[8, 11], our method can handle both sub-bands and full-bandin a single model, which makes CWS-based model highly effi-cient, less complex, and easier to use. We extensively evaluateour method on MDenseNet, UNet [6] with different scales, andthree kinds of subband settings on musdb18hq [13]. We also testour approach on a larger internal dataset. Results show that themodel with CWS input not only outperforms the model with-out CWS by a large margin, but also boosts the speed of modeltraining as well as separation.
2. Methodology
Given the raw music signal y ( n ) , our goal is to separate a set ofsource signals x ( n ) = { x j ( n ) } , j ∈ { , · · · , J } . In this paper,we focus on VAS task. Thus J = 2 and x ( n ) , x ( n ) are vocaland accompaniment track, respectively. In the time domain, theobserved mixture signal is modeled as: y ( n ) = x ( n ) + x ( n ) , (1)and in the frequency domain, it equals to Y ( t, f ) = X ( t, f ) + X ( t, f ) , (2)where t, f index time and frequency axis. Y, X , X are short-time fourier transform (STFT) of the mixture signal y ( n ) andthe source signal x ( n ) , x ( n ) . Also we should clarify that Y is not a magnitude spectrogram as was used in conventional fre-quency domain models [14, 6, 4, 15], but the complex-valuedSTFT matrices. This method is explored by [7] with consider-able improvements. In this section, we will briefly introduce theUNet, MMDenseNet, the proposed analysis-synthesis schemeand CWS feature format. Fig 2 depicts the structure of UNet [16] we use in this pa-per. The number of different scales s = 5 . It takes the mix-ture spectra Y ( t, f ) as input and outputs the Time-FrequencyMask [17] M j ( t, f ) for source j , which has identical size withthe input. The in-conv block firstly expands the input channelto 64. After that, we go through a series of convolution blocks,down/upsampling layers with skip connections. Each convolu-tion block consists of two series of 2D convolution layer, batchFigure 2: Unet Structure used in this paper normalization and rectified linear units [18]. We use × kernelsize in convolution layers, with the padding value of 1 to makesure that the frequency and time dimension will not be changedby the convolution operations. We use max-pooling and linearinterpolation to scale down and up the feature map by a factorof 2. Skip connections are added between down and up path.The input feature of each convolution layer in the up path isconcatenated with the same scale output in the down path. Toconstraint the mask value between [0 , , the sigmoid functionis added on the model’s output. Finally the source estimation isobtained by multiply the mask and the mixture STFT: ˆ X j ( t, f ) = M j ( t, f ) · Y j ( t, f ) . (3)The final separated music signal ˆ x j ( n ) is obtained through in-verse short-time fourier transform (iSTFT) using the source es-timation in Eq. (3).Table 1: Details of MMDenseNet and MDenseNet.
With only about 0.3M parameters and SOTA performance,MMDenseNet [8] is currently one of the most effective mod-els for audio source separation. It utilizes the characteristicsin each frequency band to design different MDenseNets dedi-cated for each band. The frequency axis size of band-dedicatedMDenseNet will be reduced to f/K for it only processes datawithin that specific band. The outputs of these MDenseNets, ˆ X k ( t, f/K ) , are concatenated in the frequency axis to recoverfull band prediction ˆ X sub ( t, f ) . To capture the global roughstructure, MMDenseNet also has a MDenseNet for the fullband. Then the output of the full band MDenseNet ˆ X full ( t, f ) ,is concatenated with ˆ X sub ( t, f ) and pass through a final dense-block to recover the final prediction ˆ X ( t, f ) . Each MDenseNetinside MMDenseNet can be designed independently accordingto its function and data complexity. MMDenseNet is highlyparameter efficient due to the use of denseblocks [19] and skip-connection between denseblocks. Inside denseblock, the inputof Denselayers is the concatenation of previous layers’ outputor the skip connection of the former dense block. This schemeenables model to reuse feature effectively, and thus it is highlyparameter efficient. The MMDenseNet we use in this paper hasa scale of 4, with three MDenseNets, as is shown in Table 1.The detail of MMDenseNet is described in [8].In this paper, we also conduct experiments on MDenseNet,both with CWS and without CWS. The structure of MDenseNetwe use is shown in Table 1. For a fair comparison with MM-DenseNet, we add two additional dense blocks to the lower partf MMDenseNet to form the MDenseNet we use. In this way,the scale of our MDenseNet is 5 and the total parameter numberis 0.27 M. We follow the method in [20] for subband decomposing andsignal reconstruction in the analysis and synthesis procedure.Both analysis and synthesis include a group of finite impulseresponse (FIR) uniform filter banks. We design three sets ofanalysis filter banks H k ( e jω ) and corresponding synthesis fil-ters G k ( e jω ) , where k ∈ , ..., K stands for the number ofsubbands. The design of these filters follows the procedure in[21]. We use y k ( n ) to denote the output of H k ( e jω ) . Afterdownsampling, the sample rate of y k ( n ) is K of y ( n ) .Figure 3: Channel-wise subband input
Though the total volume of input feature does not change,the channel-wise concatenation of subbands is a better input for-mat for the frequency domain model. Here we give a simple ex-planation. We use β ( l ) µ to denote the output feature map in l -thlayer, µ channel, and S ( l − λ,µ to stand for the λ -th convolutionfilters in layer l − , which output is the µ -th channel in l -thlayer. The 2D convolution layer can be described as o ( l ) µ ( i, j ) = (cid:88) λ β ( l − λ ∗ S ( l − λ,µ (4) = (cid:88) λ (cid:88) m (cid:88) n β ( l − λ ( i − m, j − n ) S ( l − λ,µ ( m, n ) β ( l ) µ ( i, j ) = φ (cid:16) o ( l ) µ ( i, j ) (cid:17) . (5)From Eq. (4), we can observe that internal variable o ( l ) µ isthe linear product and sum of β ( l − λ and S ( l − λ,µ . Thus we cannot only view the convolutional kernel as feature extractor, butalso weight between different channels. For example, if somefilters in S ( l − µ are set to all zero, then the corresponding chan-nels in β ( l − µ will not be able to pass their value to β ( l ) µ . Inthis way, the µ -th channel in feature map β ( l ) can select the ex-act channels to be used in the previous layer. The channel-wiseconcatenation enables model to assign different capability onchannel dimension, which is helpful to make the model highlyefficient.After the analysis process, we perform STFT for each y k ( n ) and the result is denoted as Y k ( f/K, t ) . Here the samplerate of y k ( n ) reduce by a factor of K , K ∈ { , , } . So thesize of frequency axis will also be reduced K fold. Then we concatenate Y k ( f/K, t ) along the subband dimension Y = (cid:16)(cid:104) Y , Y , ..., Y K , Y , Y , ..., Y CK (cid:105)(cid:17) (6)to form the input feature of the network Y ( f/K, t ) . Since thedata we use in this paper are all stereo, C equals to 2 here. Thesubscript k indexes subband and we omit the ( f/K, t ) part inthis equation for simplicity. We treat different bands as differentchannels so that model can both learn different channels inde-pendently and incorporate bands’ features in a deeper layer.The synthesis procedure is a reverse version of the analy-sis. We split the network output ˆ X j ( f/K, t ) channel-wise asthe prediction of each subband. After iSTFT, we pass the re-sult through a set of synthesis filters to reconstruct source signal ˆ x j ( n ) . The synthesis procedure is not performed during training andthe loss function is defined as the sum of two components L = L + L c , (7)where L is the L norm and L c denotes conservation loss.Conservation loss could help when two dedicated models aretrained jointly because it follows the basic model in Eq. (2)and unites two independent dedicated-models. Each loss func-tion measures the mean absolute error between network output ˆ X j ( f/K, t ) and the corresponding reference magnitude: L = (cid:88) j =1 (cid:88) t,f (cid:12)(cid:12)(cid:12) ˆ X j ( f/K, t ) − X j ( f/K, t ) (cid:12)(cid:12)(cid:12) L c = (cid:88) j =1 (cid:88) t,f (cid:12)(cid:12)(cid:12) ˆ X j ( f/K, t ) − Y j ( f/K, t ) (cid:12)(cid:12)(cid:12) . (8)We perform validation with every two hours of the trainingdata and stop the training progress if no validation improvementexists in 20 consecutive epochs. All the models are trained us-ing Adam optimizer [22] with a initial learning rate of 0.001and a dropout rate of 0.1. The learning rate decays every thirtyhours of training data with a decay rate of 0.87. The STFT ma-trices with a FFT size of 32 ms and a hop size of 8 ms are usedas the model input. The actual frame length and shift size (innumber of the samples) are automatically calculated with thesample rate of the input audio.
3. Experiments
In this section, we will first describe the dataset and evaluationmetrics used in this paper. The experimental comparison andanalysis of the advantage of CWS will then be discussed.
We mainly conduct experiments on the publicly available musdb18hq dataset [13]. It has a training set with 100 songsand a test set of 50 songs. We choose 14 songs from the train-ing set as the validation set, the same as the definition in pythonpackage musdb . To explore the limitation of the data, we alsotrained our model on an internal training set aslp . It has addi-tional 617 songs of pure vocal and 1496 songs of pure instru-ment, which are collected from the internet. Although some of https://github.com/sigsep/sigsep-mus-db able 2:
Model comparison in terms of various metrics on musdb18hq test set
GFLOPs Params (M) Train (h) SAR (A) SAR (V) SDR(A) SDR (V) SIR (A) SIR (V) Average
UNET-5 182.81 13.3 61 14.20 4.32 14.62 3.16 20.89 12.61 11.63 + CWS K =2 + CWS K =4 + CWS K =8 MMDN 27.63 0.33 59 13.22 3.73 14.50 3.12 21.18 11.73 11.25MDN 37.42 0.27 32 13.94 3.35 13.90 2.59 19.40 10.56 10.62 + CWS K =2 + CWS K =4 + CWS K =8 UNET-6 220.73 53 73 13.34 4.45 14.42 3.28 23.14 9.52 11.36 + CWS K =2 + CWS K =4 + CWS K =8 BD-UNET-6 220.73 53 149 13.87 4.79 15.20 3.94 22.73 11.33 11.98 + CWS K =2 + CWS K =4 + CWS K =8 Table 3:
Comparison with the state-of-the-art results
Model Params(M) ExtraData SDR (A)(dB) SDR (V)(dB)MMDenseNet [8] 0.33 (cid:88) (cid:88) (cid:88) (cid:88) × K =8 × UNET-5 K =8 × K =4 (cid:88) them may not be absolutely clean, experiments show that usingadditional data improves the separation performance. We fol-low the steps in [15] for data augmentation. During the train-ing stage, we randomly select, chunk, and mix vocal and instru-ments and multiply two streams with a scaling factor randomlysampled between 0.6 and 1.0. All the songs in musdb18hq and aslp are stereo and the sample rate is 44.1 kHz. We use museval [23] toolkit to compute SDR, SIR, andSAR [24] metrics for evaluation. In details, we calculate themetrics for all the segments of the song in the test set with awindow size of 1s and hop length of 1s, as commonly used inSiSEC 2018 [23]. We aggregate both the average SDR, SIR,and SAR by frames as the final score of a song, and pick themedian value from each song as the final score of test set. Allour experiments are performed on a single GTX 1080Ti GPU.For fair comparison, we report some other metrics, e.g., param-eter number and training time, as shown in Table 2. We also useGiga Floating Point Operations (GFLOPs) to weight the com-putational cost. The floating operation here is measured by athree-second long stereo input.
The result is shown in Table 2. Here we name MMDNand MDN as abbreviation of MMDenseNet and MDenseNet.UNET- N denotes scale N UNet and the prefix BD means themodel is trained with extra internal aslp dataset. A stands foraccompaniment and V stands for vocal. Value K stands fortotal subband number in CWS, as shown in Fig 3.In general, the result shows that the CWS input can consid-erably improve the performance. All the models with CWS sur-pass the models without CWS on the average SAR, SDR, andSIR by a large margin. Since the computational cost drops dras-tically with the increase of K , the model with a higher K valuewill converge more quickly. This is beneficial when the dataset is huge. Besides, a higher K will lead to a smaller feature map.This can save a lot of memory during training and evaluation,making the model and training process more flexible and easierto deploy.As can be seen from Table 2, splitting 4 bands usually hasthe best average score on all the evaluation metrics. The aver-age performances of MDN, UNET-5, UNET-6 and BD-UNET-6 increase by 5.7%, 10.1%, 10.2%, and 7.8%, when using theCWS K =4 input. Although CWS K =4 outperforms CWS K =8 by 1.5%, it takes more time, i.e., 38.5% for model training.Comparing with the model without CWS, CWS K =8 increasesthe performance by 6.8% and costs only 31.8% of the originaltraining time. Moreover, UNET-5/6 and MDN with CWS K =8 achieve the best average SDR, which is valid as a global perfor-mance measurement [24]. Thus in practice, CW S K =8 may bethe most effective one because it can yield comparable resultsin a shorter training time. CWS K =2 scenario might be the leastpreferred setting but still it has contributions to the final score.It’s also worth to mention that MDN K =8 surpasses the per-formance of MMDenseNet in [11] even with fewer parameters,far less training data and shorter training time, as shown in Ta-ble 3. The training set only has 84 songs, but MDN K =8 isstill able to exceed the performance of the model trained witha larger dataset. Moreover, our training time might be muchshorter. In [8], a single MDenseNet trained on DSD100 [12]dataset, which is comprised of 100 songs, will take 37 hours foreach instrument to train. MMDenseNet trained with extra datawill cost more than that time. By contrast, our model only takes9.7 hours to train. All the evidence strongly demonstrates theadvantage of using CWS as model input. Audio samples andcodes are available online: https://haoheliu.github.io/Channel-wise-Subband-Input/.
4. Conclusions
We present an alternative structure of input feature, namelychannel-wise subband (CWS) for VAS model in frequency do-main, in order to handle the high computational cost and limita-tion of the conventional CNNs in high-resolution MSS tasks.It overcomes the limitation of the widely used full-band ap-proach and enables the model to learn weight independentlyin each subband. Experimental results show that the pro-posed CWS improve the separation performance and reducethe computational cost significantly. On the public musdb18hq dataset, the MDenseNet with 8-bands CWS input exceeds orig-inal MDenseNet by 1.67 dB on average SDR of the voice andaccompaniment. . References [1] E. Cano, D. FitzGerald, A. Liutkus, M. D. Plumbley, and F.-R.St¨oter, “Musical source separation: An introduction,”
IEEE Sig-nal Processing Magazine , pp. 31–40, 2018.[2] J. Perez-Lapillo, O. Galkin, and T. Weyde, “Improving singingvoice separation with the wave-u-net using minimum hyperspher-ical energy,” arXiv:1910.10071 , 2019.[3] V. S. Kadandale, J. F. Montesinos, G. Haro, andE. G´omez, “Multi-task u-net for music source separation,” arXiv:2003.10414 , 2020.[4] Y. Liu, B. Thoshkahna, A. Milani, and T. Kristjansson, “Voice andaccompaniment separation in music using self-attention convolu-tional neural network,” arXiv:2003.08954 , 2020.[5] Y. LeCun and Y. Bengio,
Convolutional Networks for Images,Speech, and Time Series , 1998, pp. 255–258.[6] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar,and T. Weyde, “Singing voice separation with deep u-net convo-lutional networks,”
In Proceedings of the International Society forMusic Information Retrieval Conference , pp. 323–332, 2017.[7] W. Choi, M. Kim, J. Chung, and D. L. S. Jung, “Investigating deepneural transformations for spectrogram-based musical source sep-aration,” arXiv:1912.02591 , 2019.[8] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band densenetsfor audio source separation,” pp. 21–25, 2017.[9] J. Taghia and M. A. Doostari, “Subband-based single-channelsource separation of instantaneous audio mixtures,”
World Ap-plied Sciences Journal , pp. 784–792, 2009.[10] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih,Q. Zheng, N. Yen, C. C. Tung, and H. H. Liu, “The empiricalmode decomposition and the hilbert spectrum for nonlinear andnon-stationary time series analysis,”
Proceedings of The RoyalSociety A: Mathematical, Physical and Engineering Sciences , pp.903–995, 1998.[11] N. Takahashi, N. Goswami, and Y. Mitsufuji, “Mmdenselstm: Anefficient combination of convolutional and recurrent neural net-works for audio source separation,” pp. 106–110, 2018.[12] L. Antoine, S. Fabianrobert, R. Zafar, K. Daichi, R. Bertrand,I. Nobutaka, O. Nobutaka, and F. Julie, “The 2016 signal sepa-ration evaluation campaign,”
International Conference on LatentVariable Analysis and Signal Separation , pp. 323–332, 2017.[13] Z. Rafii, A. Liutkus, F.-R. Stter, S. I. Mimilakis, and R. Bittner,“Musdb18-hq - an uncompressed version of musdb18,” 2019.[14] P. Chandna, M. Blaauw, J. Bonada, and E. G´omez, “Content basedsinging voice extraction from a musical mixture,”
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech andSignal Processing , pp. 781–785, 2020.[15] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi,and Y. Mitsufuji, “Improving music source separation based ondeep neural networks through data augmentation and networkblending,” pp. 261–265, 2017.[16] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,”
International Con-ference on Medical Image Computing and Computer-Assisted In-tervention , pp. 234–241, 2015.[17] P. Huang, M. Kim, M. Hasegawajohnson, and P. Smaragdis,“Deep learning for monaural speech separation,”
InternationalConference on Acoustics, Speech and Signal Processing , pp.1562–1566, 2014.[18] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neu-ral networks,”
International Conference on Acoustics, Speech andSignal Processing , pp. 315–323, 2011.[19] G. Huang, Z. Liu, L. V. Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,”
Conference onComputer Vision and Pattern Recognition , pp. 2261–2269, 2017. [20] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo,S. Kang, G. Lei et al. , “Durian: Duration informed attention net-work for multimodal synthesis.” arXiv:1412.6980 , 2019.[21] I. Moazzen and P. Agathoklis, “A general approach for filter bankdesign using optimization,”
IET Journal on Signal Processing ,2014.[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv:1412.6980 , 2014.[23] F.-R. St¨oter, A. Liutkus, and N. Ito, “The 2018 signal separationevaluation campaign,” pp. 293–305, 2018.[24] E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. Plumbley, andM. E. Davies, “Performance measurement in blind audio sourceseparation,”