Semi-Supervised Singing Voice Separation with Noisy Self-Training
Zhepei Wang, Ritwik Giri, Umut Isik, Jean-Marc Valin, Arvindh Krishnaswamy
SSEMI-SUPERVISED SINGING VOICE SEPARATION WITH NOISY SELF-TRAINING
Zhepei Wang (cid:93) ∗ , Ritwik Giri † , Umut Isik † , Jean-Marc Valin † , Arvindh Krishnaswamy †† Amazon Web Services (cid:93)
University of Illinois at Urbana-Champaign
ABSTRACT
Recent progress in singing voice separation has primarily focused onsupervised deep learning methods. However, the scarcity of ground-truth data with clean musical sources has been a problem for long.Given a limited set of labeled data, we present a method to leverage alarge volume of unlabeled data to improve the model’s performance.Following the noisy self-training framework, we first train a teachernetwork on the small labeled dataset and infer pseudo-labels fromthe large corpus of unlabeled mixtures. Then, a larger student net-work is trained on combined ground-truth and self-labeled datasets.Empirical results show that the proposed self-training scheme, alongwith data augmentation methods, effectively leverage the large un-labeled corpus and obtain superior performance compared to super-vised methods.
Index Terms — Singing voice separation, self-training, self at-tention, data augmentation
1. INTRODUCTION
The task of singing voice separation is to separate the input mix-ture into different components: singing voice and accompaniment.It is a crucial problem in music information retrieval and has com-mercial usage such as music remixing and karaoke applications. Italso has the potential to provide useful information for downstreamtasks such as song identification, lyric transcription, singing voicesynthesis, and voice cloning without access to clean sources.Deep learning models have recently shown promising results insinging voice separation. Popular methods are mostly supervisedmethods, where a deep neural network is trained on a multi-trackcorpus with paired vocal and accompaniment ground-truths. [1, 2]apply dense connections between convolutional or long short-termmemory (LSTM) blocks to estimate separate masks, and [3] usesa bidirectional LSTM (BLSTM) network in the separator. Modelswith multi-scale processing further improve the performance of sep-aration. With the concatenation of features at different scales alongwith skip connections, U-Net [4] can maintain long term temporalcorrelation while processing local information with higher resolu-tion. Such architecture has been effective in both time-frequencydomain [5, 6, 7, 8] and end-to-end, time-domain methods [9, 10].Models that simultaneously process features at different resolutionswith multiple paths have also shown effectiveness in singing voiceseparation systems [11, 12].The primary challenge for supervised methods with deep learn-ing is the lack of training data with ground-truth. It is more sig-nificant for larger networks that are more prone to overfitting is-sues. There are several multi-track datasets publicly available forsinging voice separation including MIR-1K [13], ccMixter [14], and ∗ Work performed while at Amazon Web Services.
MUSDB [15]. However, these datasets are relatively small (all thesecombined are around 15 hours) and not diverse. To artificially in-crease the size of the dataset, [6, 16, 17] apply data augmentationto signal including random channel swapping, amplitude scaling,remixing sources from different songs, time-stretching, pitch shift-ing, and filtering. These methods, individually or combined, are em-pirically shown to enhance separation performance only by a limitedmargin [6].On the other hand, semi-supervised and unsupervised methodsdo not require a large corpus with a one-to-one correspondence be-tween the mixtures and ground-truth sources. [9] leverages mix-ture data by first training a silent-source detector on a small labeleddataset, then mixing recordings with only one source and mixturerecordings with the source being silent, and finally optimizing witha weakly supervised loss. [18, 19] propose generative adversarialframeworks that require isolated sources only. The distance betweenthe distributions of separator’s output and the isolated sources is min-imized with adversarial training. [20, 21] use unpaired vocal andaccompaniment data to learn non-negative, smooth representationswith a denoising auto-encoder using an unsupervised objective. [22]proposes a stage-wise algorithm where a clustering-based labeler as-signs time-frequency bin labels with confidence measure, and a stu-dent separator network is trained on these labels afterward.Self-training is a semi-supervised framework in which a pre-trained teacher model assigns pseudo-labels for unlabeled data.Then a student model is trained with the self-labeled dataset. It hasbeen applied in several applications such as image recognition [23]and automatic speech recognition [24]. Our approach follows thenoisy self-training method in which we investigate data augmen-tation methods on musical signal and evaluate how they affect theseparation performance.With the framework of noisy self-training, we aim to improvethe performance of a deep separator network where only a limitedamount of data with ground-truth is available. The contribution ofthis work is listed as follows:• We use a large unlabeled corpus to improve separation resultsunder the noisy self-training framework.• We show how data augmentation can improve the model’sability to generalize with a focus on random remixing be-tween sources.• We propose to use a voice activity detector to evaluate thequality of self-labeled data in the student training to performdata filtering. a r X i v : . [ ee ss . A S ] F e b . SYSTEM DESCRIPTION2.1. Noisy Self-Training for Singing Voice Separation Our proposed self-training framework for singing voice separationconsists of the following steps:1. Train a teacher separator network M on a small labeleddataset D l .2. Assign pseudo-labels for the large unlabeled dataset D u with M to obtain the self-labeled dataset D .3. Filter data samples from D to obtain D f .4. Train a student network M with D l ∪ D f .This framework can be made iterative by repeating steps 2 to 4,using the student network M i as the new teacher to obtain a self-labeled dataset D i +1 , and training a new student model M i +1 . Theprocess stops when there is no performance gain. We illustrate theframework pipeline in Figure 1. Train teacher model Assign pseudo-labelsTrain student modelMake student modelthe new teacherFilter self-labeled dataMulti-track data Mixture data
Self-labeled data
Fig. 1 . The pipeline of noisy self-training for singing voice separa-tion.
Poor quality self-labeled samples may contain leakage of the singingvoice in the accompaniment tracks or leakage of musical backgroundin the vocal tracks. To filter out these samples, we evaluate the qual-ity of the data with a voice activity detector (VAD).The VAD takes the STFT magnitude spectrogram of the mixtureas input and predicts the frame-level energy ratio between the sourceand the mixture. We use the 2D-CRNN architecture with the sameconfiguration as in [25]. We train two separate VADs to estimate theenergy ratio of vocal over mixture, and accompaniment over mix-ture, respectively. The ground-truth is defined as 0 when both thevocal and the accompaniment are silent. The VADs are trained withthe same labeled dataset as the one for the teacher separator modelusing binary cross-entropy loss.To measure the leakage of accompaniment in vocal, we passthe self-labeled vocal track into the accompaniment activity detec-tor. Similarly, we feed the self-labeled background track into the vo-cal activity detector to detect leakage of the singing voice. A frameis defined as a “poor quality frame” if either its accompaniment en-ergy in the vocal track or its vocal energy in the background trackis higher than some threshold. We count the total number of “poorquality frames” for each song, and songs with a smaller percentageof such frames are considered to have higher quality.
Data noise is a key component in the noisy self-training framework.We apply data augmentation methods for the training of both teacherand student models. Each training sample contains both vocal andaccompaniment tracks of duration 30 seconds. To augment the train-ing set, we randomly select a window of duration T seconds (with T < ) from the sample. We also perform “random mixing” bymixing vocal and background sources from two randomly selectedsongs with a probability of p . Besides, we apply dynamic mixingratio, pitch shifting, lowpass filtering, and EQ filtering to the data. We use the PoCoNet [26] for both teacher and student models. Theneural network takes the concatenation of real and imaginary partsof the mixture’s STFT spectrogram as input. The separator estimatesthe complex ratio masks for each source. The wave-form signal isobtained by applying inverse STFT transform on the estimated spec-trograms.The separator is a fully-convolutional 2D U-Net architecturewith DenseNet and attention blocks. Each DenseNet block containsthree convolutional layers, each followed by batch normalizationand Rectified Linear Unit (ReLU). Convolutional operations arecausal only in the time direction but not in the frequency direction.We choose a kernel size of × and a stride size of 1, and thenumber of channels increases from 32, 64, 128 to 256. We controlthe size of the network by varying the number of levels in U-Netand the maximum number of channels. In the attention module,the number of channels is set to 5 and the encoding dimension forkey and query is 20. The connections of layers in DenseNet andattention blocks follow [26]. Frequency-positional embeddings areapplied to each time-frequency bin of the input spectrogram. Fortime frame t and frequency bin f , the embedding vector is definedas: ρ ( t, f ) = (cos( π fF ) , cos(2 π fF ) , . . . , cos(2 k − π fF )) , (1)where F is the frequency bandwidth and k = 10 is the dimension ofthe embedding. For each output source, the loss function is the weighted sum ofwave-form and spectral loss: L s ( y, ˆ y ) = λ audio L audio ( y, ˆ y ) + λ spec L spec ( Y, ˆ Y ) , (2)where s is the output source, y, ˆ y are time-domain output and ref-erence signals and Y = | STFT ( y ) | , ˆ Y = | STFT (ˆ y ) | are the corre-sponding STFT magnitude spectrograms. We choose both L audio ( · ) and L spectral ( · ) to be (cid:96) loss. The total loss is the weighted sum ofeach source: L ( y, ˆ y ) = λ voc L voc ( y, ˆ y ) + λ acc L acc ( Y, ˆ Y ) , (3)
3. EXPERIMENTAL SETUP3.1. Dataset
We use MIR-1K [13], ccMixter [14], and the training partition ofMUSDB [15] as the labeled dataset for supervised training. Thetraining set contains approximately 11 hours of recordings. We useAMP [27] as the unlabeled dataset for training the student model.DAMP dataset contains more than 300 hours of vocal and back-ground recordings from karaoke app users. Since these recordingsare not professionally produced, there exists bleeding of music inthe vocal tracks and bleeding of singing voice in the accompanimenttracks as well; hence, it is not suitable for supervised source separa-tion.
To reduce dimensionality and speed up processing, we downsampleeach track to 16 kHz and convert it to mono. We further segmentthe recordings to non-overlapping 30-second segments, and if thesegment is less than 30 seconds we zero-pad to the end of the signal.The spectrograms are computed with a 1024-point STFT with a hopsize of 256.
For both teacher and student training, we minimize Equation 3 withan Adam optimizer with an initial learning rate of 1e-4, and we de-crease the learning rate by half for every 100k iterations until it’sno greater than 1e-6. We set λ audio = λ spec = 1 in Equation 2 and λ voc = λ acc = 1 in Equation 3.To augment the training set, we randomly select a window sizeof T = 2 . , , seconds as the input to the model to experimentwith the effect of input length, each with a batch size of 4, 2, and1, respectively. The maximal batch size is chosen under the memorylimit. We experiment with different probabilities of applying randommixing with p = 0 , . , . , . , .The teacher model is trained on the labeled datasets. Then, itassigns pseudo-labels for the unlabeled dataset. We infer vocal labelsusing DAMP vocal tracks as input to the teacher model and inferaccompaniment labels from DAMP accompaniment tracks. Due tothe leakage in these vocal and background tracks, they can be viewedas mixtures where one source is more likely to dominate the other,compared to normal mixtures.
4. EVALUATION RESULTS AND DISCUSSIONS4.1. Evaluation Framework
As in previous studies on singing voice separation [1, 2, 4, 7, 19], wemeasure the signal-to-distortion ratio (SDR) to evaluate the separa-tion performance. Following the SiSec separation campaign [28], weuse the 50 songs from the test partition of MUSDB [15] as the testset. We partition each audio track into non-overlapping one-secondsegments, and we take the median of segment-wise SDR for eachsong and report the median from all 50 songs. We use the pythonpackage museval to compute SDR. We select the configuration for the teacher model by experiment-ing with different input window sizes of training samples and thenumber of model parameters. Table 1 shows the test SDR for thecombinations of input and model size. We first observe that usinglonger input size improves both vocal and accompaniment SDR. Theimprovement can be attributed to the attention blocks where longerinput context provides more information for separation. Another ob-servation is that larger models do not guarantee performance gain. https://sigsep.github.io/sigsep-mus-eval/ Len Size Prob Use SDR(V) SDR(A) Mean(s) (1e6) RM DAMP2.5 8.3 0 No 1.84 10.31 6.080.5 1.72 9.51 5.625 0 3.55 10.91 7.230.5 4.08 11.34 7.7110 0 Yes 3.93 11.46 7.700 No 5.88 12.52 9.20.25 6.35 12.56 9.460.5 7.06 13.35 10.210.75 6.98 13.36 10.171.0 6.91 13.66
Table 1 . Test performance metrics (SDR in dB) for teacher modelcandidates. We experiment with various input sizes, number ofmodel parameters, and the probability of random mixing to pick thebest configuration for the teacher model. The best performance ishighlighted in bold.Size top % SDR(V) SDR(A) Mean(1e6)8.3 1 6.57 12.92 9.7515.4 1 7.27 13.73 10.50.5 7.52 13.91 10.720.25 7.8 13.92 . Test performance metrics (SDR in dB) for student mod-els. We experiment with different model sizes and the proportionof quality-controlled self-labeled samples. The best performance isshown in bold.The largest model (15.4M param) performs significantly better thanthe smallest one (1.6M) but is slightly worse than the 8.3M versionfor the probability of random mixing p = 0 , . . Using the bestcombination of input length (10 seconds) and model size (8.3M),we experiment with different probability of applying random mix-ing. [6] shows that random mixing does not have a positive effecton test SDR, and one possible explanation is that it creates mixtureswith somewhat independent sources. Our experiments, however, in-dicate that random mixing alone significantly improves the results.The best performance is obtained when random mixing is alwaysapplied. Our observations are consistent with the argument in [29]that “one-versus-all” separation benefits from mixing independenttracks. Intuitively, mixtures with dependent sources are more diffi-cult to separate. Random mixing makes it easier for the model tolearn and to converge faster on the training set. Meanwhile, by mix-ing up sources from different songs, the training set becomes morediverse and the model has a better ability to generalize at inferencetime.In addition, we verify that the DAMP dataset should not be ap-plied directly in supervised source separation tasks by including thisdataset along with the other labeled datasets. We experiment withtwo model sizes (1.6M and 8.3M) using 10-second input withoutame (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) . Comparison of the proposed method and other baseline models. The best performance is shown in bold.random mixing, and the SDR values degrade sharply for both cases. Table 2 summarizes the test SDR for student models. As opposed tothe teacher model, the 15.4M model has a 0.75 dB SDR gain com-pared to the 8.3M model. The observation that the larger capacitystudent model improves the performance is consistent with the find-ings in [23].To verify the quality control approach with VADs, we first countfor each song the number of “poor-quality frames” as defined in Sec-tion 2.1.1 from three different datasets: DAMP, self-labeled DAMP,and MUSDB. From the visualization in Figure 2, the unprocessedDAMP contains the highest percentage of data with a large numberof poor-quality frames, the distribution of MUSDB is concentratedin the low count region, while the self-labeled dataset lies in be-tween. This implies that the count of “poor-quality frames” basedon the output of VADs is a reasonable indicator of the quality of datasamples. The experimental results demonstrate that the proposeddata filtering method with VADs further improves the performance.The highest SDR is obtained when only the top quarter of the self-labeled data is included in the training. Incorporating a higher per-centage of self-labeled data may provide more diversity but is morelikely to include samples with poor quality, thus negatively affectingthe model’s performance.
To compare the separation of singing voice with state-of-the-art, wealso include models that separate the mixture into four sources. Ithas been shown in [6] that, these four-source models have similarvocal separation performance compared to two-source models, eventhough the four-source separation task is more challenging than thetwo-source counterpart; possibly because of the additional super-vision provided by different instrumental sources in the multi-tasklearning setup. Hence, we include the vocal SDR values of state-of-the-arts for four-source models [10, 11] in our comparison. Our pro-posed approach, the student model using quality control with VADs,obtains the highest vocal and average SDR among all models, andthe vocal separation outperforms others by a significant margin. Theaccompaniment SDR is higher than the baseline model with monoinput [7] but worse than the stereo ones [1, 2]. Stereo input con-tains more spatial information for accompaniment than vocal sincethe left and right channel difference for background tracks are at amuch larger scale than vocal tracks. Such information may improvethe separation of accompaniment.
DAMP Self-labeled DAMP MUSDB180100200300400 O b s e r v e d v a l u e s Violin Plot of VAD "Poor" Frame Count
Fig. 2 . Count of “poor-quality frames” for different datasets.
5. CONCLUSION
We present a semi-supervised method for singing voice separationto deal with the scarcity of data with ground-truth. Using the noisyself-training framework, we can effectively make use of a large un-labeled dataset to train a deep separation network. Experimental re-sults show that random mixing as data augmentation improves modeltraining, and the data filtering method with pre-trained voice activitydetectors improves the quality of the self-labeled training samples.Our study serves as a foundation for more complicated systems suchas using stereo input, working with unlabeled datasets with mixtureonly (as opposed to noisy source tracks), and extending the teacher-student loop with additional iterations.
6. REFERENCES [1] Naoya Takahashi and Yuki Mitsufuji, “Multi-scale multi-band densenets for audio source separation,”
CoRR , vol.abs/1706.09588, 2017.[2] N. Takahashi, N. Goswami, and Y. Mitsufuji, “Mmdenselstm:An efficient combination of convolutional and recurrent neuraletworks for audio source separation,” in ,2018, pp. 106–110.[3] F.-R. St¨oter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, “Open-unmix - a reference implementation for music source separa-tion,”
Journal of Open Source Software , 2019.[4] Daniel Stoller, Sebastian Ewert, and Simon Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio sourceseparation,” in
ISMIR , 2018.[5] Romain Hennequin, Anis Khlif, Felix Voituret, and ManuelMoussallam, “Spleeter: a fast and efficient music source sep-aration tool with pre-trained models,”
Journal of Open SourceSoftware , vol. 5, no. 50, pp. 2154, 2020, Deezer Research.[6] Laure Pr´etet, Romain Hennequin, Jimena Royo-Letelier, andAndrea Vaglio, “Singing voice separation: A study on trainingdata,”
ICASSP 2019 - 2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , pp. 506–510, 2019.[7] Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro, andEmilia G´omez, “Multi-channel u-net for music source separa-tion,” 2020.[8] Yuzhou Liu, Balaji Thoshkahna, Ali A. Milani, and TraustiKristjansson, “Voice and accompaniment separation in mu-sic using self-attention convolutional neural network,”
ArXiv ,vol. abs/2003.08954, 2020.[9] Alexandre D´efossez, Nicolas Usunier, L´eon Bottou, and Fran-cis Bach, “Demucs: Deep extractor for music sources withextra unlabeled data remixed,”
ArXiv , vol. abs/1909.01174,2019.[10] Alexandre D´efossez, Nicolas Usunier, L´eon Bottou, and Fran-cis R. Bach, “Music source separation in the waveform do-main,”
ArXiv , vol. abs/1911.13254, 2019.[11] Eliya Nachmani, Yossi Adi, and Lior Wolf, “Voice separationwith an unknown number of multiple speakers,”
ArXiv , vol.abs/2003.01531, 2020.[12] David Samuel, Aditya Ganeshan, and Jason Naradowsky,“Meta-learning extractors for music source separation,”
ICASSP 2020 - 2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , pp. 816–820,2020.[13] C. Hsu and J. R. Jang, “On the improvement of singing voiceseparation for monaural recordings using the mir-1k dataset,”
IEEE Transactions on Audio, Speech, and Language Process-ing , vol. 18, no. 2, pp. 310–319, 2010.[14] A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, and L. Daudet,“Kernel additive models for source separation,”
IEEE Trans-actions on Signal Processing , vol. 62, no. 16, pp. 4298–4310,2014.[15] Zafar Rafii, Antoine Liutkus, Fabian-Robert St¨oter,Stylianos Ioannis Mimilakis, and Rachel Bittner, “TheMUSDB18 corpus for music separation,” Dec. 2017.[16] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Taka-hashi, and Y. Mitsufuji, “Improving music source separationbased on deep neural networks through data augmentation andnetwork blending,” in , 2017, pp.261–265. [17] Alice Cohen-Hadria, Axel R¨obel, and Geoffroy Peeters, “Im-proving singing voice separation using deep u-net and wave-u-net with data augmentation,” , pp. 1–5, 2019.[18] D. Stoller, S. Ewert, and S. Dixon, “Adversarial semi-supervised audio source separation applied to singing voiceextraction,” in , 2018, pp. 2391–2395.[19] Michael Michelashvili, Sagie Benaim, and Lior Wolf, “Semi-supervised monaural singing voice separation with a maskingnetwork trained on synthetic mixtures,”
ICASSP 2019 - 2019IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , pp. 291–295, 2019.[20] Stylianos Ioannis Mimilakis, Konstantinos Drossos, and Ger-ald Schuller, “Unsupervised interpretable representation learn-ing for singing voice separation,”
ArXiv , vol. abs/2003.01567,2020.[21] Stylianos Ioannis Mimilakis, Konstantinos Drossos, and Ger-ald Schuller, “Revisiting representation learning for singingvoice separation with sinkhorn distances,”
ArXiv , vol.abs/2007.02780, 2020.[22] Prem Seetharaman, Gordon Wichern, Jonathan Le Roux,and Bryan Pardo, “Bootstrapping deep music separationfrom primitive auditory grouping principles,”
ArXiv , vol.abs/1910.11133, 2019.[23] Qizhe Xie, Eduard H. Hovy, Minh-Thang Luong, and Quoc V.Le, “Self-training with noisy student improves imagenet clas-sification,”
ArXiv , vol. abs/1911.04252, 2019.[24] Daniel S. Park, Yinyong Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V. Le, “Im-proved noisy student training for automatic speech recogni-tion,”
ArXiv , vol. abs/2005.09629, 2020.[25] Fatemeh Pishdadian, Gordon Wichern, and Jonathan Le Roux,“Finding strength in weakness: Learning to separate soundswith weak supervision,”
ArXiv , vol. abs/1911.02182, 2019.[26] Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin,Karim Helwani, and Arvindh Krishnaswamy, “Poconet: Betterspeech enhancement with frequency-positional embeddings,semi-supervised conversational data, and biased loss,” in
Pro-ceedings of the Annual Conference of the International SpeechCommunication Association, INTERSPEECH , 2020.[27] Inc Smule, “DAMP-VSEP: Smule Digital Archive of MobilePerformances - Vocal Separation,” Oct. 2019.[28] Fabian-Robert St¨oter, Antoine Liutkus, and Nobutaka Ito,
The2018 Signal Separation Evaluation Campaign , pp. 293–305,06 2018.[29] Ethan Manilow, Prem Seetharman, and Justin Salamon,