[PDF] Mixing-Specific Data Augmentation Techniques for Improved Blind Violin/Piano Source Separation

Abstract

Blind music source separation has been a popular and active subject of research in both the music information retrieval and signal processing communities. To counter the lack of available multi-track data for supervised model training, a data augmentation method that creates artificial mixtures by combining tracks from different songs has been shown useful in recent works. Following this light, we examine further in this paper extended data augmentation methods that consider more sophisticated mixing settings employed in the modern music production routine, the relationship between the tracks to be combined, and factors of silence. As a case study, we consider the separation of violin and piano tracks in a violin piano ensemble, evaluating the performance in terms of common metrics, namely SDR, SIR, and SAR. In addition to examining the effectiveness of these new data augmentation methods, we also study the influence of the amount of training data. Our evaluation shows that the proposed mixing-specific data augmentation methods can help improve the performance of a deep learning-based model for source separation, especially in the case of small training data.

Full PDF

MMixing-Speciﬁc Data Augmentation Techniques forImproved Blind Violin/Piano Source Separation

Ching-Yu Chiu

Graduate Program of Multimedia Systems and Intelligent ComputingNational Cheng Kung University and Academia Sinica , [email protected]

Wen-Yi Hsiao

Yating Music TeamTaiwan AI Labs , [email protected]

Yin-Cheng Yeh

Yating Music TeamTaiwan AI Labs , [email protected]

Yi-Hsuan Yang

Research Center for IT Innovation, Academia Sinica , [email protected]

Alvin Wen-Yu Su

Dept. CSIE, National Cheng Kung University , [email protected]

Abstract —Blind music source separation has been a popularand active subject of research in both the music informationretrieval and signal processing communities. To counter the lackof available multi-track data for supervised model training, adata augmentation method that creates artiﬁcial mixtures bycombining tracks from different songs has been shown usefulin recent works. Following this light, we examine further inthis paper extended data augmentation methods that considermore sophisticated mixing settings employed in the modern musicproduction routine, the relationship between the tracks to becombined, and factors of silence. As a case study, we consider theseparation of violin and piano tracks in a violin piano ensemble,evaluating the performance in terms of common metrics, namelySDR, SIR, and SAR. In addition to examining the effectiveness ofthese new data augmentation methods, we also study the inﬂuenceof the amount of training data. Our evaluation shows that theproposed mixing-speciﬁc data augmentation methods can helpimprove the performance of a deep learning-based model forsource separation, especially in the case of small training data.

Index Terms —Music source separation, data augmentation

I. I

NTRODUCTION

Music source separation, i.e., separating the sources (in-struments) involved in an audio recording, has been a majorresearch topic in the signal processing community, partlydue to its wide downstream applications in music upmixingand remixing, karaoke, DJ-related applications, and as a pre-processing tool for other problems [1]–[16]. Its technicaldifﬁculty has also been well acknowledged. For example,the estimation of the number of sources involved in a songremains challenging [17]. Even when the number of sourcesis known or given beforehand, in supervised training weneed the multi-track recordings that provide the ground truthsingle-instrument tracks (a.k.a., ‘stems’) that compose a song.Such multi-track recordings are rarely publicly available dueto copyright issues [18], [19]. What are typically availableinstead are the mixed versions of the songs, where the multipletracks have been mixed and combined into a monaural (i.e.,one-channel) or stereo (two-channel) recording. The lack ofavailable data with ground truth stems not only limits the de-velopment of data-driven methods, but also hinders systematicevaluation of new methods proposed for the task.

Fig. 1. Flowchart of the baseline random mixing (bottom-left) and theproposed four augmentation methods (on the right). Given the collected violinsolos and piano solos (in gray block), three types (Silence, Mixing, andPairing) of augmentations (in blue blocks) can be applied to produce fourgroups of augmented violin/piano chunks (in green buckets). If none of themis applied, we use the violin/piano chunks (in orange bucket) for the baselinerandom mixing. Best viewed in color.

Over the past decades, the main methods for tackling thistask can be roughly classiﬁed into two categories: model-basedmethods and data-centered methods. As discussed in a recentreview paper [8], the performance of model-based methodscould change dramatically when their core assumptions are notmet. On the other hand, data-centered methods rely heavily onthe availability of professionally produced or recorded multi-track data, which is hard to come by due to copyright issues.As some medium-scale multi-track datasets have been re-leased in the past few years, the development of data-drivenmodels grows fast. A data-driven source separation modelis typically a supervised model which is trained by takingan audio mixture (a monaural or stereo recording) as theinput, and aiming to recover the tracks (i.e., multiple monaural a r X i v : . [ ee ss . A S ] A ug r stereo recordings) that compose the mixture. In doingso, there have been two main approaches in the literature.In the ﬁrst approach, which we refer to as the no dataaugmentation approach, the tracks that compose an inputmixture are originally from the same song. In other words,when we have N multi-track songs, we would have exactly N input/output pairs for modeling training. In the secondapproach, or random-mixing data augmentation [5], [13],tracks from different songs are randomly combined to createaudio mixtures, leading to artiﬁcial input/output pairs. In sucha case, we can have N (cid:29) N input/output pairs. The downsideof this approach is these input mixtures are not realistic soundmixtures in terms of the tonic, harmonic and rhythmic relationsamong the tracks that compose the mixtures. But, for thepurpose of training data-driven models for source separation,the beneﬁt of the resulting great increase in the number oftraining data seems to outweigh this potential concern, asdemonstrated in the literature [5], [13].While there are some other data augmentation methodssuch as adding noise or randomly dropout [20], [21], theaforementioned random-mixing data augmentation , albeit sim-ple, has been shown particularly successful [13]. However,when mixing two tracks, there are actually multiple aspectsto consider [22], suggesting room for the development ofmore advanced mixing-speciﬁc data augmentation methodsfor source separation. To our best knowledge, this has notyet been investigated in the literature. It is therefore our goalto develop, and to empirically evaluate the effectiveness of,new data augmentation methods that stem from the random-mixing approach. We consider in total three types of mixing-speciﬁc data augmentation methods for source separation, asconceptually visualized in Figure 1 and detailed in Section IV.Besides, current available data is still not enough for manycommon musical instruments, such as the violin. To investigatethe beneﬁt of data augmentation for music source separation ingeneral, we consider in this paper the separation of violin andpiano tracks in a violin piano ensemble, a task that has rarelybeen considered in the literature. Moreover, we consider thecase when no multi-track recordings of violin piano ensembles are available for model training, but instead only a collectionof piano solos and violin solos . In such a case, mixing-baseddata augmentation becomes a major viable approach.In sum, the main contribution of this study is to propose aseries of data augmentation/selection approaches that enable non-paired violin/piano solo stems to approximate features ofrealistic paired stems, which in turn facilitate the training ofdeep learning-based source separation models.For reproducibility, we share the code, pre-trained models,the audio ﬁles of the ground truth and separated stems (by the‘Wet’ model; see Section IV) of the test data publicly at https://github.com/SunnyCYC/aug4mss and https://sunnycyc.github.io/aug4mss demo/.Below, we review related work and the adopted networkarchitecture for source separation in Section II. Section IIIdescribes the training and test data employed in our imple-mentation. Section IV presents the proposed data augmentation Fig. 2. The network architecture of Open-Unmix [13]. Taking an input mixspectrogram, the model produces the target spectrogram by multiplying theinput with a full-band mask. The model is based on a three-layer bidirectionalLSTM capable of taking input of arbitrary lengths. The number on the left sideof each component indicates the frequency bin size of the output. The modelcrops and only processes the spectrogram under 16kHz, and reconstructsintermediate product to full band by the last fully connected layer ‘fc3.’ methods, while Section V talks about the evaluation results.Finally, we conclude the paper in Section VI.II. B

ACKGROUND

A. Related works

Data augmentation for improving the performance of deepneural networks has been an important topic in the ﬁeld ofmusic information retrieval (MIR) in recent years. Schluter andGrill [21] were one of the earliest to systematically explore theutility of music data augmentation for singing voice detectionwith neural networks. They found pitch shifting combinedwith time stretching and random frequency ﬁltering to bequite helpful in reducing the classiﬁcation error. Uhlich etal. [5] proposed two neural network architectures capable ofyielding state-of-the-art results for music source separationat that time, and further boosted the performance throughdata augmentation and network blending. Hawthorne et al.[23] experiments with the use of mixing techniques suchas equalization, contrast, and reverberation in an attempt tomake their automatic piano transcription model more robustto different recording environments and piano qualities. To ourknowledge, such mixing techniques have not been employedin existing work on source separation.New neural network architectures for blind music sourceseparation have been continuously proposed as well. Forexample, Manilow et al. [15] intellectually utilized the inherentsynergy between transcription and source separation to im-prove both tasks using a multi-task learning architecture. Liuand Yang [10] combined dilated convolution with modiﬁedgated recurrent units (GRU) to extend the receptive ﬁeld ofach dilated GRU unit, enabling their model to perform betterand faster than state-of-the-art models for separating vocalsand accompaniment.

B. Model Architecture

As our focus is on the data augmentation techniques, weadopt an existing blind source separation framework calledOpen-Unmix [13] as the backbone architecture in our work.Open-Unmix is a hybrid convolutional-recurrent architecture(see Figure 2) that is open source. It takes a ﬁxed-length chunkof the Short-time Fourier transform (STFT) spectrogram ofthe mixture as the input and aims to get the correspondingseparated spectrogram of one of the sources at the output.Although the model is trained on ﬁxed-length chunks, at test-ing time it can be applied to spectrograms of arbitrary length.The parameters of the network is learned by minimizing thedifference between the spectrograms of the ground truth andthe separation result, calculated in terms of mean square error.The original Open-Unmix model does not deal with theseparation of piano and violin tracks [13], but it is easy touse their code base to train the model on our piano andviolin data. In comparison, the other famous open-sourceseparation model, called Spleeter [14], has a built-in functionto isolate out the piano track from an input mixture, but itdoes not provide the script for retraining the model to isolateout the violin track. In consequence, we consider the pre-trained Spleeter model released by the authors as the baselinemethod for performance comparison, instead of the backbonearchitecture of our model.III. D

ATASETS

A. Violin/piano Solo as Training Stems

We collected six hours of classical violin solo recordingsand six hours of pop piano solo recordings from the Internetas our training and validation data. For each instrument, ﬁvehours of data is allocated for training and the remaining onehour for validation. All songs are divided into 10-secondchunks. During training and validation, one chunk will be ran-domly selected from each instrument for mixing and servingas the input data.

B. MedleyDB as Evaluation Data

To evaluate the performance of our method on real data,we select multi-track songs that contain realistic violin andpiano stems from the MedleyDB [18], [19] for evaluation.There are only 16 such songs, highlighting the difﬁculty ofcollecting multi-track recordings for model training. Moreover,after listening to these songs one-by-one, we discard 10 ofthem, as there are severe leakage issues and accordingly thepiano/violin stems are not purely piano/violin. We also notethat most of the remaining six songs contain more than oneviolin stems. For such songs, we consider the combinationof the piano track and one of the violin tracks as the inputmixture, thereby creating multiple partially realistic mixturesfrom the same song. All the other instruments from these songs

TABLE IT

HREE TYPES OF MIXING - RELATED DATA AUGMENTATION METHODS WEPROPOSE HERE , AND THE CORRESPONDING PARAMETER SETTINGS

Mixing

Description Scale RangeContrast Amount 1 − − − − − Paring

Description Threshold NotesChroma distance 0.48 Mean within songs: 0.45Mean across songs: 0.51Correlation 20 Within songs: 2 − − Silence

Description Threshold NotesReduce silence 20 top db (e.g., cello) are also excluded. As a result, we have 16 violinpiano ensembles in total for evaluation.We note that, due to copyright restrictions, we are unfor-tunately not able to share the audio ﬁles of the training data.Researchers interested in violin piano separation would have tocollect the training data on their own. However, as mentionedby the end of Section I, we make public the aforementioned16 violin piano ensembles so that people can evaluate theirmodels on the same test set.IV. P

ROPOSED A UGMENTATION /S ELECTION M ETHODS

As shown in Figure 1 and Table I, we consider three typesof augmentation methods here. The mixing type augmentationincludes factors related to the common mixing process, suchas the use of equalization, contrast, reverb, and the addition ofpink noise. The pairing type methods are designed based onconsideration for the tonic, harmonic and rhythmic relationsbetween the real paired stems. Lastly, the silence type methodis designed for eliminating the difference of silence duration inviolin/piano stems to avoid potential imbalance of the trainingdata. In what follows, we refer to the unprocessed originalstems as the original stems . A. Wet Stems

Following [23], we apply the common approaches in themixing process and also the addition of pink noise as ourﬁrst augmentation method. The pink noise is employed tosimulate the background noise seen in real recordings. Theaugmentation parameters are shown in Table I. For eachoriginal stem, we set a 30% probability to apply a speciﬁcprocess (e.g. equalization). The pink noise is applied using colorednoise , and the others are applied using a pythonpackage called pysox [24]. In what follows, we also refer toa model trained with this data augmentation method (i.e., the‘Wet’ method) as the

Wet model . B. Chroma Distance-based Pairing

Based on general principles of music theory or psychoa-coustics, the paired stems in music are usually highly coherentn pitch, or are in similar keys. Therefore, instead of randomlyselecting stems for combination, this augmentation methodonly picks the stems that have short chroma distance withone another for combination, in a hope that the resultingmixture would be more similar to real ensembles. Speciﬁcally,we average the chromagram, a representation of the time-varying intensities of the twelve different pitch classes, toderive a 12-dimensional chroma feature for each stem, andthen calculate the Euclidean distance between all violin/pianostems. We then need a threshold for selecting qualiﬁed trainingviolin/piano stems. In doing so, we rely on statistics calculatedfrom the MedleyDB test songs. As shown in Table I, the meanchroma distance between the violin/piano stems from the samesong of MedleyDB is 0.45, shorter than the mean distanceof violin/piano stems from different songs of MedleyDB. Wetherefore set the threshold to 0.48. Only stems with chromadistance lower than this threshold will be selected and mixedas our training data.

C. Correlation-based Pairing

Another pairing method is to consider whether the pi-ano stem and violin stem are active (i.e., non-silent) at thesame time; namely, whether they co-occur. To implementthis, we calculate the absolute value of the 2-dimensionalcross correlation between the waveform magnitude (using scipy.signal.correlate2d ) of all the violin/pianostems and set a threshold for selecting the stems to be com-bined for training. As shown in Table I, the cross correlationvalues for true paired stems in MedleyDB test songs rangefrom 2–30. We therefore empirically set the threshold to 20.

D. Silence Removal before Mixing

For the case of violin piano ensemble, we empiricallyobserve sometimes the activity of the violin part would be toosparse, making the two sounds unbalanced in the mixture. Weconsider it worth investigating whether this would inﬂuencethe performance of the resulting separation model. Therefore,we apply librosa.effects.split [25] with ‘top db =

20 to remove the silence part in the training/validation data,before dividing them into 10-second chunks for mixing. Wehave also experimented with other values of ‘top db and found20 works better. V. E

XPERIMENTS

A. Experiment Settings & Evaluation Metrics

As discussed in Section I, there are many common in-struments that suffer from the lack of multi-track data. Forinstruments less common, such as erhu and suona, even thenumber of solo recordings is limited. Therefore, we study theeffectiveness of the proposed augmentation methods in twoscenarios: a data-limited one, and a data-rich one.Speciﬁcally, for each instrument and each augmentationmethod, we train (from scratch) a separation model usingthe Open-Unmix architecture and the corresponding processedstems. In each training epoch, the model will randomly select N pairs of stems from the pool of processed stems to mix as the training data. To simulate the data-limited case, weset N =

250 and adopt only 16 minutes of the trainingdata for each instrument. For the data-rich case, we use thefull data for training and set N = monaural tracks with a44,100 hz sampling rate. The window size and hop length ofSTFT are 4,096 and 1,024. The predicted spectrograms areconverted back to time-domain waveforms using the Grifﬁn-Lim algorithm for phase estimation before inverse STFT.For evaluation metrics, we adopt the signal-to-distortionratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifacts ration (SAR), implemented in the BSS_eval toolkit[1]. Following the convention in SiSEC (https://sisec.inria.fr/),we report the median values over the testing songs.We consider the following two methods as the baselinemethods. First, we use the Open-Unmix model to train theseparation model using the existing random-mixing data aug-mentation method; we refer to this method as the

Random method. Second, we use the ofﬁcial pretrained 5-stem modelof

Spleeter [14], the current state-of-the-art for singing voiceseparation. Speciﬁcally, we use the default ‘mask’-based im-plementation. Given an input mixture, it generates as theoutput separated stems of piano, vocal, drum, bass, and others.From the result of Spleeter, we sum all but the piano stem asthe separated violin stem.Training an Open-Unmix based network using any of theproposed data augmentation method takes around 15 hours onan NVIDIA GeForce GTX 1080 GPU. At testing time, it takes ∼ B. Results & Discussion

The result of the data-limited case is shown in Table II. Thefollowing observations can be made: First, for piano, all theproposed augmented models outperform the two baselines inSDR and SAR. In particular, the improvement made by ‘Cor-relation’ method is more than 2 dB for both metrics. Second,for violin, except for the ‘Correlation’ method that still gainsimprovement in SDR and SIR, the help of other augmentationmethods becomes less obvious. The performance drop of ‘Wet’method for violin indicates potential risk for the randomlyapplied mixing type method to distort the data under data-limited scenario. Finally, all the Open-Unmix based modelswe train greatly outperform the pre-trained, violin-agnosticSpleeter model in almost all the metrics for both instruments,except for the piano in terms of SIR. This is not surprising,as the Spleeter model has not been speciﬁcally trained onviolin data. The failure of recognizing the violin also hurtsits performance on the piano.

ABLE IIT

HE MEDIAN VALUES OF THE EVALUATION METRICS ( IN DB ) OVER THE

EDLEY DB SONGS FOR THE DATA - LIMITED SCENARIO

Piano

Method SDR SIR SARSpleeter (pretrained) [14] 5.20

NonSilence 8.76 13.30 11.51

Violin

Method SDR SIR SARSpleeter (pretrained) [14] 0.28 2.05 0.22Random-mixing 1.08 6.22 2.76Wet 0.73 3.46

Chroma 1.54 5.48 2.70Correlation

The result of the data-rich case are shown in Table III. Fromall the metrics we can see that the help of augmentationsin both instruments becomes less obvious. For the piano,the baseline ‘Random method seems strong enough. For theviolin, the baseline ‘Random method is only inferior to theproposed method with a small margin. This implies that, whenthe training data is big enough, the diversity of the datawould also increase and overshadow the help of sophisticatedaugmentation methods.From Tables II and III, we also see that the separationperformance generally improves along with the increase inthe number of training data, which is not surprising.The spectrograms of the original and predicted audio ofa text song are shown in Figure 3. It can be seen that, thepre-trained Spleeter model cannot separate the piano fromthe violin well. In contrast, despite some visible residuals,the models trained with the proposed methods work well inseparating the piano and violin. We also see from the result ofthe violin that some of the proposed methods (e.g., the ‘wet’model) can get rid of the leakage of the piano part and noisesin the low frequency bands, while the baseline methods suffer.Figure 4 provides another example result. As there is a noteoffset in the latter half of the violin, the performance differenceamong the methods can be seen more clearly. We can see againthat the proposed method suffers less from the leakage of thepiano in the low frequency bands. Besides, the high-frequencypiano residuals can be found to be a bit more in the resultof the baseline ‘random’ model than in the ‘wet’ model. Wealso note that an extra low-frequency component (highlightedby the red arrow) in the ground truth violin in this example.Since it is below the lowest frequency of the violin sound (i.e.,196 hz), we listened to it and found that it seems to be somebackground noise or reverb. The proposed methods can alsoexclude this low-frequency part from the separation result.VI. C

ONCLUSIONS AND F UTURE WORKS

In this paper, we have proposed and investigated a numberof mixing-related data augmentation methods to facilitate thetraining of deep learning models for music source separation.

TABLE IIIE

VALUATION RESULT ( IN DB ) FOR THE DATA - RICH SCENARIO

Piano

Method SDR SIR SARRandom-mixing

Wet 11.76

Violin

Method SDR SIR SARRandom-mixing 3.84 17.16 4.38Wet

Correlation 4.19 15.86 4.23NonSilence 3.03

Result demonstrates the effectiveness of the proposed augmen-tation methods in the case of small data. When the trainingdata is big enough, the existing random-mixing based approach[5] is strong enough. We have also shown that the implementedmodels outperform the current available best model, Spleeter[14], in violin/piano separation. We believe such trainingand augmentation methods have potential in beneﬁting othersource separation tasks with limited amount of training dataavailable. In the future, we are interested in extending ourexperiments to other popular yet less investigated instruments(e.g., guitar and saxophone), and in experimenting with moreadvanced mixing methods (e.g., applying mixing after thestems are selected, rather than applying mixing to the stemsindividually beforehand as done here).R

EFERENCES[1] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement inblind audio source separation,”

IEEE Trans. Audio, Speech, LanguageProcessing , vol. 14, no. 4, pp. 1462–1469, July 2006.[2] Y.-H. Yang, “Low-rank representation of both singing voice and musicaccompaniment via learned dictionaries,” in

Proc. International Societyfor Music Information Retrieval Conference , 2013.[3] P.-K. Jao, L. Su, Y.-H. Yang, and B. Wohlberg, “Monaural music sourceseparation using convolutional sparse coding,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , 2016.[4] H. Tachibana, Y. Mizuno, N. Ono, and S. Sagayama, “A real-time audio-to-audio karaoke generation system for monaural recordings based onsinging voice suppression and key conversion techniques,”

Journal ofInformation Processing , vol. 24, no. 3, pp. 470–482, 2016.[5] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, andY. Mitsufuji, “Improving music source separation based on deep neuralnetworks through data augmentation and network blending,”

Proc. IEEEInt. Conf. Acoust. Speech Signal Process , pp. 261–265, 2017.[6] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, andT. Weyde, “Singing voice separation with deep U-Net convolutional net-works,” in

Proc. International Society for Music Information RetrievalConference , 2017.[7] D. Stoller, S. Ewert, and S. Dixon, “Adversarial semi-supervised audiosource separation applied to singing voice extraction,” in

Proc. IEEEInternational Conference on Acoustics, Speech and Signal Processing ,2018, pp. 2391–2395.[8] Z. Raﬁi, A. Liutkus, F. R. Stoter, S. I. Mimilakis, D. Fitzgerald, andB. Pardo, “An overview of lead and accompaniment separation inmusic,”

IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 26, no. 8,pp. 1307–1335, 2018.[9] J.-Y. Liu and Y.-H. Yang, “Denoising auto-encoder with recurrent skipconnections and residual regression for music source separation,” in

Proc. IEEE Int. Conf. Machine Learning and Applications , 2018.ig. 3. The spectrograms of a mixture from the MedleyDB (titled‘MatthewEntwistle ImpressionsOfSaturn p0 v1.wav’ in our GitHub repo),the ground truth violin/piano stems, and the predicted stems by differentmethods, trained under the data-rich scenario.[10] ——, “Dilated convolution with dilated GRU for music source separa-tion,” in

Proc. Int. Joint Conference on Artiﬁcial Intelligence , 2019.[11] G. Meseguer-Brocal and G. Peeters, “Conditioned-U-Net: Introducinga control mechanism in the U-Net for multiple source separations,” in

Proc. International Society for Music Information Retrieval Conference ,2019, pp. 159–165.[12] J. H. Lee, H. Choi, and K. Lee, “Audio query-based music source sep-aration,” in

Proc. International Society for Music Information RetrievalConference , 2019, pp. 878–885.[13] F.-R. St¨oter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, “Open-Unmix - Areference implementation for music source separation,”

Journal of OpenSource Software , vol. 4, no. 41, p. 1667, 2019.[14] D. P. Kingma and J. L. Ba, “Spleeter: A fast and state-of-the art musicsource separation tool with pre-trained models,” in

International Societyfor Music Information Retrieval Conference, Late-breaking paper , 2019.[15] E. Manilow, P. Seetharaman, and B. Pardo, “Simultaneous separationand transcription of mixtures with multiple polyphonic and percussiveinstruments,” in

Proc. IEEE International Conference on Acoustics,Speech and Signal Processing , 2020, pp. 771–775.[16] T.-H. Hsieh, K.-H. Cheng, Z.-C. Fan, Y.-C. Yang, and Y.-H. Yang,“Addressing the confounds of accompaniments in singer identiﬁcation,”in

Proc. IEEE International Conference on Acoustics, Speech and SignalProcessing , 2020.[17] Y. Luo, Z. Chen, J. R. Hershey, J. L. Roux, and N. Mesgarani, “Deep Fig. 4. The result of another mixture from the MedleyDB (‘MatthewEn-twistle AnEveningWithOliver p0 v6.wav’).clustering and conventional networks for music separation: Strongertogether,” in

Proc. IEEE International Conference on Acoustics, Speechand Signal Processing , 2017, pp. 61–65.[18] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, andJ. Bello, “MedleyDB: A multitrack dataset for annotation-intensive MIRresearch,”

Proc. International Society for Music Information RetrievalConference , pp. 155–160, 2014.[19] R. M. Bittner, J. Wilkins, H. Yip, and J. P. Bello, “Medleydb 2.0 : Newdata and a system for sustainable data collection,”

Proc. InternationalSociety for Music Information Retrieval Conference , pp. 2–4, 2016.[20] H. Erdogan and T. Yoshioka, “Investigations on data augmentation andloss functions for deep learning based speech-background separation,”

Proc. INTERSPEECH , pp. 3499–3503, 2018.[21] J. Schl¨uter and T. Grill, “Exploring data augmentation for improvedsinging voice detection with neural networks,”

Proc. InternationalSociety for Music Information Retrieval Conference , pp. 121–126, 2015.[22] B. D. Man, “Towards a better understanding of mix engineering,” Ph.D.dissertation, Queen Mary University of London, 2017.[23] C. Hawthorne et al. , “Enabling factorized piano music modeling andgeneration with the MAESTRO dataset,”

Proc. International Conferenceon Learning Representations , pp. 1–12, 2019.[24] R. M. Bittner, E. Humphrey, and J. P. Bello, “Pysox : Leveraging theaudio signal processing power of Sox in Python,”

Proc. InternationalSociety for Music Information Retrieval Conference , pp. 4–6, 2016.[25] B. McFee et al. , “Librosa: Audio and music signal analysis in Python,”