[PDF] Multi-Task Self-Supervised Pre-Training for Music Classification

Abstract

Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset. Therefore, unsupervised learning techniques become popular approaches in solving machine listening problems. Particularly, a self-supervised learning technique utilizing reconstructions of multiple hand-crafted audio features has shown promising results when it is applied to speech domain such as emotion recognition and automatic speech recognition (ASR). In this paper, we apply self-supervised and multi-task learning methods for pre-training music encoders, and explore various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks. We investigate how these design choices interact with various downstream music classification tasks. We find that using various music specific workers altogether with weighting mechanisms to balance the losses during pre-training helps improve and generalize to the downstream tasks.

Full PDF

MMULTI-TASK SELF-SUPERVISED PRE-TRAINING FOR MUSIC CLASSIFICATION

Ho-Hsiang Wu ,(cid:63) , Chieh-Chi Kao , Qingming Tang , Ming Sun Brian McFee , Juan Pablo Bello , Chao Wang Music and Audio Research Laboratory, New York University, USA Alexa Speech, Amazon

ABSTRACT

Deep learning is very data hungry, and supervised learning es-pecially requires massive labeled data to work well. Machine lis-tening research often suffers from limited labeled data problem, ashuman annotations are costly to acquire, and annotations for au-dio are time consuming and less intuitive. Besides, models learnedfrom labeled dataset often embed biases speciﬁc to that particulardataset. Therefore, unsupervised learning techniques become pop-ular approaches in solving machine listening problems. Particu-larly, a self-supervised learning technique utilizing reconstructionsof multiple hand-crafted audio features has shown promising resultswhen it is applied to speech domain such as emotion recognition andautomatic speech recognition (ASR). In this paper, we apply self-supervised and multi-task learning methods for pre-training musicencoders, and explore various design choices including encoder ar-chitectures, weighting mechanisms to combine losses from multipletasks, and worker selections of pretext tasks. We investigate howthese design choices interact with various downstream music clas-siﬁcation tasks. We ﬁnd that using various music speciﬁc workersaltogether with weighting mechanisms to balance the losses duringpre-training helps improve and generalize to the downstream tasks.

Index Terms — Self-supervised learning, multi-task learning,music classiﬁcation

1. INTRODUCTION

Deep learning has shown great successes with end-to-end learnedrepresentations replacing hand-crafted features in various machineperception ﬁelds, including computer vision, natural language pro-cessing and machine listening, especially in supervised learningparadigm. However, unlike ImageNet for computer vision, whichcontains millions of labeled images, human annotated datasets formachine listening are usually small [1]. Therefore, learning fromlimited labeled data [2] is especially important. There are exist-ing methods such as transfer learning [3] and domain adaptation,where models learned from different tasks with larger datasets aretransferred and ﬁne-tuned to another task/domain, and unsupervisedlearning [4, 5, 6], such as generative models [7, 8], where datadistribution is often learned through reconstruction of the signal.Self-supervised learning [9, 10, 11, 12], as one sub-ﬁeld ofunsupervised learning, exploits the structure of the input data toprovide supervision signals. It has become more popular in recentyears, showing good improvement in multiple ﬁelds. For self-supervised learning, raw signals are transformed, and models are (cid:63)

Work done at AmazonThis work is partially supported by the National Science Foundationaward optimized with reconstruction or contrastive losses against originalsignals, where preserving of temporal or spatial data consistency isassumed for learning meaningful representations. These represen-tations are proven useful to generalize and solve downstream tasks.On the other hand, multi-task learning [13] improves generality bysolving multiple tasks altogether during training, while weightingmechanisms among the losses from each task are crucial [14, 15].Self-supervised and multi-task learning techniques are combinedand applied to the speech domain, and they have shown success in[16, 17], where reconstruction of various hand-crafted features areused for pre-training, and further learned representations are eval-uated with downstream emotion recognition and automatic speechrecognition (ASR) tasks.Similar to speech, music is also a highly structured audio sig-nal. There are many hand-crafted features designed speciﬁcally formusic to solve various music information retrieval (MIR) tasks. Inthis paper, we are interested in applying self-supervised and multi-task learning methods for pre-training music encoders. We explorevarious design choices including encoder architectures, weightingmechanisms to combine losses from pretext tasks, and worker se-lections to reconstruct various music speciﬁc hand-crafted features,such as Mel-frequency cepstral coefﬁcients (MFCCs) for timbre[18], Chroma for harmonic [19], and Tempogram [20] for rhythmicattributes. Our main contributions are 1. provide suggestions on bestdesign choice among all the variations from our experiments, and2. investigate how different selections of pretext tasks interact withthe performance of downstream music classiﬁcation tasks, includinginstrument, rhythm and genre.

2. METHOD

Encoder

Waveform MFCC Chroma

Regression

Pre-training Encoder with AudioSet Downstream Tasks

MusicClassification

Violin, Guitar,Chacha, Tango,Jazz, Rock,...

DatasetsOpenMICExtended BallroomFMA GenreLPS Proso Tempo

Fig. 1 . Diagram of multi-task self-supervised encoder pre-trainingand downstream music classiﬁcation evaluation. a r X i v : . [ c s . S D ] F e b two-stage approach involving unsupervised or self-supervisedpre-training and supervised learning for training to evaluate ondownstream tasks is commonly adopted [9, 10, 16, 17] in recentliterature, especially in the context of limited labeled data, whererepresentation learning is key. In order to evaluate the effectivenessof the pre-training, simple linear or multi-layer perceptron (MLP)classiﬁers are usually used where the pre-trained encoders are re-quired to capture meaningful representations to perform well onlinear separation evaluation tasks. As shown in Figure 1, we combine self-supervised and multi-tasklearning ideas for pre-training. Raw audio inputs are passed throughmultiple encoding layers, and outputs are two dimensional represen-tations with temporal information. These encoded representationsare then used for solving pretext tasks via workers including wave-form reconstruction, and prediction of various popular hand-craftedfeatures used in MIR to guide the learning jointly.

After pre-training, we remove the workers, and feed the encoder out-puts to MLP classiﬁers for downstream tasks. We adopt three train-ing scenarios proposed in [16]: 1.

Supervised : Initialize the en-coder weights randomly and train from scratch on the downstreamdatasets directly. 2.

Frozen : Treat the pre-trained encoder as featureextractor with frozen weights, concatenate the feature extractor withtrainable MLP classiﬁers and only optimize the classiﬁer weights.3.

Fine-tuned : Initialize the encoder with pre-trained weights andﬁne-tune the encoder with downstream tasks altogether.

3. EXPERIMENTAL DESIGN

We experiment with various design choices during pre-training in-cluding 1. Encoder architectures, 2. Pretext tasks for worker selec-tions, 3. Weighting mechanisms for losses from pretext tasks. Weprovide more details on the downstream evaluations and data usagefor both pre-training and downstream tasks in section 3.4 and 3.5.

We compare two encoder architectures proposed in two relevantstudies in speech domain which inspire our work. We refer the twoencoder architectures as PASE [16] and PASE+ [17], respectively.1.

PASE : We use the same encoder architecture as the originalPASE work [16] with source code implementation . The ﬁrst layer isbased on SincNet [21], where the raw input waveform is convolvedwith a set of parameterized Sinc functions implementing rectangularband-pass ﬁlters. The authors claim that SincNet has fewer parame-ters and provides better interpretability. SincNet layer is followed by7 one-dimensional convolutional blocks, batch normalization [22],and multi-parametric rectiﬁed linear unit activation [23]. We use thesame model parameters as provided in the original work includingkernel widths, number of ﬁlters, and strides. The set of parametersfor convolutional layers emulates a 10ms sliding window.2. PASE+ : PASE+ [17] improves upon PASE [16] by addingskip connections and Quasi-Recurrent Neural Network (QRNN)[24] layers to capture longer-term contextual information. QRNNlayers consist of interleaved convolutional layers with RNN layers https://github.com/santi-pdp/pase to speed up training with parallel optimization, while maintainingcompatible performance. Inspired by the original PASE [16] work, we select waveform recon-struction, log power spectrum (LPS) and prosody features as base-line workers. We then choose three popular hand-crafted features inMIR ﬁeld including MFCC, Chroma, and Tempogram as mixed-inworkers. For waveform reconstruction, encoder layers are applied inreverse order to decode embeddings and optimized with mean abso-lute error (MAE) loss. For all the other workers, we use MLP withconvolutional layers, and mean squared error (MSE) loss.Waveform, LPS, and MFCC are commonly used in machine lis-tening. Chroma is inspired from western 12-tone theory which fre-quencies are folded into 12 bins as one octave. Tempogram [20]takes local auto-correlation of the onset strength envelope. As usedin [16], prosody features include zero crossing rate (ZCR), energy,voice/unvoice probability and fundamental frequency (F0) estima-tion, resulting in 4 features concatenated along with temporal di-mension. For LPS, MFCC, Chroma, Tempogram and prosody, weuse librosa implementations with hop length = 160, n fft = 2048,sr = 16000 in order to align each hop as 10ms to match encoderparameters, with other default parameters. We explore two weighting mechanisms to combine losses from eachworker during pre-training. 1.

Equal weighted by simply sum uplosses from different workers for backpropagation. 2.

Re-weighted by taking the validation losses per worker of the ﬁrst 10 epochs fromequal weighted training, averaging the loss per worker, taking thereciprocal as the new weights and applying those to retrain fromscratch. The intuition is that the losses from each worker will thencontribute more equally during backpropagation optimization.

After pre-training, we remove the workers for pretext tasks and con-catenate the output of the encoder with a simple MLP classiﬁer. Theinput layer of the MLP is to take mean pooling across temporal di-mension, resulting in one 512 dimension embedding, followed by 1fully connected layer to adapt to output dimensions correspondingto the number of classes of each downstream dataset. We train withthree scenarios discussed in section 2.2, including supervised, frozenand ﬁne-tuned, all with the same hyper-parameters, Adam optimizer[25] with initial learning rate as 0.001 and early stopping criteriawith patience value of 10 on validation loss. We run 10 trials foreach experiment in this paper to get statistically meaningful results.

We use clips in AudioSet [26] with ”Music” label for pre-training.We are able to acquire ˜2M (97% of the original AudioSet data) clips,within which there are ˜980k clips labeled with ”Music”. We ran-domly select 100k for pre-training, resulting in ˜83 hours of data. https://github.com/librosa/librosa .5.2. Datasets for downstream evaluation OpenMIC [27], Extended Ballroom [28] and FMA Small (FMA)[29], three publicly available classiﬁcation datasets are used fordownstream evaluation as representative samples of well-knownMIR tasks. These datasets range from different number of clips, clipduration, and number of classes. For all three datasets, we reportmacro F1 scores as shown in the ﬁgures.1. OpenMIC [27]: OpenMIC is a multi-label instrument clas-siﬁcation dataset containing 15k samples total with providedtrain/valid/test splits as well as masks for strong positive andnegative examples for each class. We follow similar setup asthe ofﬁcial baseline by training 20 binary classiﬁers.2. Extended Ballroom [28]: Extended Ballroom (4k samples)is a multi-class dance genre classiﬁcation dataset. We followthe same setup as [30] by removing 4 categories due to datasetimbalance, resulting in only using 9 categories.3. FMA Small [29]: FMA Small (8k samples) is a multi-classmusic genre classiﬁcation dataset with 8 genre categories.

4. RESULTS AND DISCUSSIONS

We ﬁrst show results of encoder choices and whether pre-traininghelps. All workers (waveform (W), LPS (L), prosody (P), MFCC(M), Chroma (C) and Tempogram (T), where WLP are also referredto as baseline) and frozen scenario are used. We then dive deeper intothe effects of different weighting mechanisms, and ablation study ofworker selections, for which we also report results in frozen sce-nario. Finally, we investigate whether ﬁne-tuning further improvesperformance. . Comparisons of encoder architectures (PASE vs PASE+).Left, center, and right ﬁgures are Macro F1 metrics on differentdownstream tasks of frozen scenario. Red and green dotted linesrepresent PASE and PASE+ encoder with supervised training (sce-nario 1) directly on downstream dataset from scratch.From Figure 2, we observe that for all three downstream tasks,PASE+ outperforms PASE. This is not surprising as PASE+ is a morepowerful encoder with ˜8M parameters, skip-connection and QRNNlayer, and PASE has only ˜6M parameters and basic convolutionallayers. This conﬁrms with the ﬁndings from original PASE+ [17]work applied to speech data.The dotted lines are trained supervisedly (scenario 1) fromscratch directly on the downstream tasks with random weights ini-tialization. It shows that pre-training in general helps to initializethe encoder weights better, resulting in better performance on down-stream tasks. One exception is PASE for OpenMIC, we hypothesizethat it is because OpenMIC already contains enough data to trainPASE encoder (with less capacity) from scratch well, which is not https://github.com/cosmir/openmic-2018 the case for PASE+. This shows that pre-training for encoders withlarger capacities is especially helpful when evaluating on down-stream tasks with limited labeled data. We conducted experimentsusing PASE+ through out the remaining paper as it’s a better encoderfor our tasks. . Comparisons of equal weighted vs re-weighted for differentworker selections on all downstream tasks. PASE+ encoder archi-tecture is used with frozen scenarios. Y-axis is Macro F1 classiﬁ-cation metrics. X-axis are labeled with WLP (waveform, LPS, andprosody), M (MFCC), C (Chroma), and T (Tempogram). No ﬁlledand ﬁlled color represent equal weighted and re-weighted mecha-nisms correspondingly. From all trials, circles represent mean whilethe length of the bar represents standard deviation.In Figure 3, we show results comparing equal weighted and re-weighted mechanisms with different worker selections during pre-training. We see that re-weighted mechanism (ﬁlled color) helpsto boost the inﬂuences from various workers to the performance ofdownstream tasks in general. For Extended Ballroom on the rightespecially, we see clearly that results with workers containing Tem-pogram are improved by a large margin. L o g L o ss Equal Weighted Re-weighted

WLPMCTWLPMCT

Fig. 4 . Log loss per worker for ﬁrst 20 epochs. X-axis is number ofepochs. On the left is equal weighted. On the right is re-weightedwhere loss weights are balanced using reciprocal of mean losses perworker from equal weighted pre-training.We further examine losses per worker during pre-training asshown in Figure 4. We can see that with equal weighted on the left,LPS (L) almost dominates all losses and Tempogram (T) worker losscontributes the least with two orders of magnitude smaller, but forre-weighted on the right, each worker contributes more equally.

Figure 5 shows the relative difference in accuracy by including dif-ferent workers over the WLP baseline. We observe that differentworker selections affect variously to different downstream tasks.Tempogram helps the most across all different combinations es-pecially for Extended Ballroom. MFCC is usually important formost of the downstream tasks as it captures the low-level attributesdifferentiating instrument and genre. Chroma is however at a dis-advantage, especially for OpenMIC, since Chroma is designed toormalize for timbre, which is important for instrumentation. MFCConly hurts slightly on Extended Ballroom as it brings together dif-ferent dance genres with similar timbre, and separates music fromsame dance genre that changes in timbre.

Fig. 5 . Relative improvement (%) of different additional music spe-ciﬁc workers included during pre-training compared to WLP on dif-ferent downstream tasks.These variations can be further compensated to show improve-ment across all tasks by using all workers as shown on the right mostof each subplot in Figure 5. We observe relative improvement addingall workers compared to WLP baseline by 1.9%, 4.5% and 14% onOpenMIC, FMA and Extended Ballroom datasets respectively. Thisindicates that workers complement each other, and the encoders areable to use signals from diversiﬁed workers to generalize better tovarious downstream tasks.

Fig. 6 . Confusion matrices of Extended Ballroom. On the left isWLP baseline. On the right are the differences between WLP+T andWLP, and WLP+MCT and WLP+T. Red and blue colors indicatepositive and negative changes respectively.

Fig. 7 . Confusion matrices of FMA. On the left is WLP baseline.On the right are the differences between WLP+M and WLP, WLP+Tand WLP+M, and WLP+MT and WLP+T. Red and blue colors indi-cate positive and negative changes respectively.We then show confusion matrices of Extended Ballroom andFMA in Figure 6 and 7. In Figure 6, we show the difference betweenWLP + T and WLP, and observe that adding Tempogram helps dif-ferentiate Chacha with Jive and Samba, which differ in rhythm andtempo, as well as Foxtrot with Quickstep, and Viennesewaltz withWaltz, as the two pairs of dance genres originate from similar music playing in different speed. Adding MFCC and Chroma further helpsdifferentiate Foxtrot with Rumba and Viennesewaltz as additionaltimbre cues are provided.In Figure 7, we observe that even adding MFCC (WLP+M -WLP) helps in general as hypothesized, however, it misclassiﬁesElectronic with Hip-Hop and International, and Pop with Hip-Hopand Rock, as there might be similar instruments used in these gen-res, resulting in similar timbre. Adding Tempogram (WLP+T -WLP+M) corrects the mistakes made on Electronic and Pop genres,but misclassifying International with Folk and Instrumental. Fi-nally, adding both workers (WLP+MT - WLP+T) provides furtherimprovements upon MFCC and Tempogram only. In general weobserve improvements with positive values (red) in diagonal andnegative (blue) in off-diagonal. . Comparisons of frozen and ﬁne-tuned on

5. CONCLUSION

In this paper, we explore different design choices for pre-trainingmusic encoders with multi-task and self-supervised learning tech-niques, and show that this method, when combined with differentencoder architectures, generally beneﬁts for downstream tasks. Theimprovement is clearer and more stable when ( . REFERENCES [1] Keunwoo Choi, George Fazekas, and Mark Sandler, “Auto-matic tagging using deep convolutional neural networks,”

IS-MIR 2016 , 2016.[2] Jaehun Kim, Juli´an Urbano, Cynthia CS Liem, and Alan Han-jalic, “One deep music representation to rule them all? acomparative analysis of different representation learning strate-gies,”

Neural Computing and Applications , vol. 32, no. 4, pp.1067–1093, 2020.[3] Keunwoo Choi, Gy¨orgy Fazekas, Mark Sandler, andKyunghyun Cho, “Transfer learning for music classiﬁcationand regression tasks,” in

ISMIR 2017 . International Society forMusic Information Retrieval, 2017, pp. 141–149.[4] Jan W¨ulﬁng and Martin A Riedmiller, “Unsupervised learningof local features for music classiﬁcation.,” in

ISMIR , 2012, pp.139–144.[5] Steffen Schneider, Alexei Baevski, Ronan Collobert, andMichael Auli, “wav2vec: Unsupervised pre-training for speechrecognition,” arXiv preprint arXiv:1904.05862 , 2019.[6] Alexei Baevski, Steffen Schneider, and Michael Auli, “vq-wav2vec: Self-supervised learning of discrete speech represen-tations,” arXiv preprint arXiv:1910.05453 , 2019.[7] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Si-monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, An-drew Senior, and Koray Kavukcuoglu, “Wavenet: A generativemodel for raw audio,” arXiv preprint arXiv:1609.03499 , 2016.[8] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, LucasGestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Br´ebisson,Yoshua Bengio, and Aaron C Courville, “Melgan: Generativeadversarial networks for conditional waveform synthesis,” in

Advances in Neural Information Processing Systems , 2019, pp.14910–14921.[9] Jason Cramer, Ho-Hsiang Wu, Justin Salamon, and Juan PabloBello, “Look, listen, and learn more: Design choices for deepaudio embeddings,” in

ICASSP 2019 . IEEE, 2019, pp. 3852–3856.[10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-offrey Hinton, “A simple framework for contrastive learningof visual representations,” arXiv preprint arXiv:2002.05709 ,2020.[11] Ting Chen, Simon Kornblith, Kevin Swersky, MohammadNorouzi, and Geoffrey Hinton, “Big self-supervised mod-els are strong semi-supervised learners,” arXiv preprintarXiv:2006.10029 , 2020.[12] Beat Gfeller, Christian Frank, Dominik Roblek, Matt Shariﬁ,Marco Tagliasacchi, and Mihajlo Velimirovi´c, “Spice: Self-supervised pitch estimation,”

IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , vol. 28, pp. 1118–1128, 2020.[13] Yun-Ning Hung, Yi-An Chen, and Yi-Hsuan Yang, “Multitasklearning for frame-level instrument recognition,” in

ICASSP2019 . IEEE, 2019, pp. 381–385.[14] Alex Kendall, Yarin Gal, and Roberto Cipolla, “Multi-tasklearning using uncertainty to weigh losses for scene geome-try and semantics,” in

Proceedings of the IEEE conferenceon computer vision and pattern recognition , 2018, pp. 7482–7491. [15] Ting Gong, Tyler Lee, Cory Stephenson, Venkata Renduchin-tala, Suchismita Padhy, Anthony Ndirango, Gokce Keskin, andOguz H Elibol, “A comparison of loss weighting strategies formulti task learning in deep neural networks,”

IEEE Access ,vol. 7, pp. 141627–141632, 2019.[16] Santiago Pascual, Mirco Ravanelli, Joan Serr`a, Antonio Bona-fonte, and Yoshua Bengio, “Learning problem-agnostic speechrepresentations from multiple self-supervised tasks,”

INTER-SPEECH , 2019.[17] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, PawelSwietojanski, Joao Monteiro, Jan Trmal, and Yoshua Bengio,“Multi-task self-supervised learning for robust speech recogni-tion,” in

ICASSP 2020 . IEEE, 2020, pp. 6989–6993.[18] Franz De Leon and Kirk Martinez, “Enhancing timbre modelusing mfcc and its time derivatives for music similarity estima-tion,” in . IEEE, 2012, pp. 2005–2009.[19] Daniel PW Ellis, “Classifying music audio with timbral andchroma features,” 2007.[20] Peter Grosche, Meinard M¨uller, and Frank Kurth, “Cyclic tem-pogram—a mid-level tempo representation for musicsignals,”in . IEEE, 2010, pp. 5522–5525.[21] Mirco Ravanelli and Yoshua Bengio, “Speaker recognitionfrom raw waveform with sincnet,” in . IEEE, 2018, pp. 1021–1028.[22] Sergey Ioffe and Christian Szegedy, “Batch normalization: Ac-celerating deep network training by reducing internal covariateshift,”

Proc. of ICML , 2015.[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Delving deep into rectiﬁers: Surpassing human-level perfor-mance on imagenet classiﬁcation,” in

Proceedings of the IEEEinternational conference on computer vision , 2015, pp. 1026–1034.[24] James Bradbury, Stephen Merity, Caiming Xiong, and RichardSocher, “Quasi-recurrent neural networks,”

Proc. of ICLR ,2017.[25] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[26] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, ArenJansen, Wade Lawrence, R Channing Moore, Manoj Plakal,and Marvin Ritter, “Audio set: An ontology and human-labeleddataset for audio events,” in

ICASSP 2017 . IEEE, 2017, pp.776–780.[27] Eric Humphrey, Simon Durand, and Brian McFee, “Openmic-2018: An open data-set for multiple instrument recognition.,”in

ISMIR , 2018.[28] Ugo Marchand and Geoffroy Peeters, “The extended ballroomdataset,”

ISMIR Late-breaking Session , 2016.[29] Micha¨el Defferrard, Kirell Benzi, Pierre Vandergheynst, andXavier Bresson, “Fma: A dataset for music analysis,” in , 2017.[30] Yeonwoo Jeong, Keunwoo Choi, and Hosan Jeong, “Dlr: To-ward a deep learned rhythmic representation for music contentanalysis,” arXiv preprint arXiv:1712.05119arXiv preprint arXiv:1712.05119