MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers
MMusiCoder: A Universal Music-AcousticEncoder Based on Transformers
Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang ∗ Zhejiang University, Hangzhou, PR China { zhaoyilun, wuxinda, jiuzhou, zhangkejun } @[email protected] Abstract.
Music annotation has always been one of the critical topicsin the field of Music Information Retrieval (MIR). Traditional models usesupervised learning for music annotation tasks. However, as supervisedmachine learning approaches increase in complexity, the increasing needfor more annotated training data can often not be matched with availabledata. Moreover, over-reliance on labeled data when training supervisedlearning models can lead to unexpected results and open vulnerabilitiesfor adversarial attacks. In this paper, a new self-supervised music acous-tic representation learning approach named MusiCoder is proposed. In-spired by the success of BERT, MusiCoder builds upon the architectureof self-attention bidirectional transformers. Two pre-training objectives,including Contiguous Frames Masking (CFM) and Contiguous ChannelsMasking (CCM), are designed to adapt BERT-like masked reconstruc-tion pre-training to continuous acoustic frame domain. The performanceof MusiCoder is evaluated in two downstream music annotation tasks.The results show that MusiCoder outperforms the state-of-the-art mod-els in both music genre classification and auto-tagging tasks. The effec-tiveness of MusiCoder indicates a great potential of a new self-supervisedlearning approach to understand music: first apply masked reconstruc-tion tasks to pre-train a transformer-based model with massive unlabeledmusic acoustic data, and then finetune the model on specific downstreamtasks with labeled data.
Keywords:
Music Information Retrieval · Self-supervised Representa-tion Learning · Masked Reconstruction · Transformer
The amount of music has been growing rapidly over the past decades. As aneffective measure for utilizing massive music data, automatically assigning onemusic clip a set of relevant tags, providing high-level descriptions about the musicclip such as genre, emotion, theme, are of great significance in MIR community,e.g., music recommendation [5, 28], music emotion recognition [25, 42, 43], mu-sic composition [32]. Some researchers have applied several supervised learningmodels [14, 20, 22, 30], which are trained on human-annotated music data. How-ever, the performance of supervised learning method are likely to be limited bythe size of labeled dataset, which is expensive and time consuming to collect. a r X i v : . [ ee ss . A S ] A ug Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang
Moreover, compared with self-supervised learning, supervised learning approachis more likely to be attacked because of over-reliance on labeled data [6, 17].Recently, self-supervised pre-training models [23,24,34,40], especially BERT,dominate Natural Language Processing (NLP) community. BERT proposes aMasked Language Model (MLM) pre-training objective, which can learn a pow-erful language representation by reconstructing the masked input sequences inpre-training stage. The intuition behind this design is that a model available torecover the missing content should have learned a good contextual representa-tion. In particular, BERT and its variants [26, 41, 44] have reached significantimprovements on various NLP benchmark tasks [39]. Compared with the textdomain whose inputs are discrete word tokens, in acoustics domain, the inputsare usually multi-dimensional feature vectors (e.g., energy in multiple frequencybands) of each frame, which are continuous and smoothly changed over time.Therefore, some particular designs need to be introduced to bridge the gaps be-tween discrete text and contiguous acoustic frames. We are the first to apply theidea of masked reconstruction pre-training to the continuous music acoustic do-main. In this paper, a new self-supervised pre-training scheme called MusiCoderis proposed, which can learn a powerful acoustic music representations throughreconstructing masked acoustic frame sequence in pre-training stage.Our contributions can be summarized as:1. We present a new self-supervised pre-training model named MusiCoder. Mu-siCoder builds upon the structure of multi-layer bidirectional self-attentiontransformers. Rather than relying on massive human-labeled data, Musi-Coder can learn a powerful music representation from unlabeled music acous-tic data, which is much easier to collect.2. The reconstruction procedure of BERT-like model is adapted from classi-cation task to regression task. In other word, MusiCoder can reconstructcontinuous acoustic frames directly, which avoids an extra transformationfrom continuous frames to discrete word tokens before pre-training.3. Two pre-training objectives, including Contiguous Frames Masking (CFM)and Contiguous Channels Masking (CCM), are proposed to pre-train Musi-Coder. The ablation study shows that both CFM and CCM objectives caneffectively improve the performance of MusiCoder pre-training.4. The MusiCoder is evaluated on two downstream tasks: GTZAN music genreclassification and MTG-Jamendo music auto-tagging. And MusiCoder out-performs the SOTA model in both tasks. The success of MusiCoder indicatesa great potential of applying transformer-based masked reconstruction pre-training in Music Information Retrieval (MIR) field.
In the past few years, pre-training models and self-supervised representationlearning have achieved great success in NLP community. Huge amount of self-supervised pre-training models based on multi-layer self-attention transform-ers [37], such as BERT [12], GPT [33], XLNet [41], Electra [9] are proposed. usiCoder 3
Fig. 1. System overview of the MusiCoder modelAmong them, BERT is perhaps the most classic and popular one due to its sim-plicity and outstanding performance. Specifically, BERT is designed to recon-struct the masked input sequences in pre-training stage. Through reconstructingthe missing content from a given masked sequence, the model can learn a pow-erful contextual representation.More recently, the success of BERT in NLP community draws the atten-tion of researchers in acoustic signal processing field. Some pioneering works[2, 23, 24, 34, 40] have shown the effectiveness of adapting BERT to AutomaticSpeech Recognition (ASR) research. Specifically, they design some specific pre-training objectives to bridge the gaps between discrete text and contiguous
Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang acoustic frames. In vq-wav2vec [2], input speech audio is first discretized toa K-way quantized embedding space by learning discrete representation fromaudio samples. However, the quantization process requires massive comput-ing resources and is against the continuous nature of acoustic frames. Someworks [7, 23, 24, 34, 40] design a modied version of BERT to directly utilize con-tinuous speech. In [7, 23, 24], continuous frame-level masked reconstruction isadapted in BERT-like pre-training stage. In [40], SpecAugment [29] is appliedto mask input frames. And [34] learns by reconstructing from shuffled acousticframe orders rather than masked frames.As for MIR community, representation learning has been popular for manyyears. Several convolutional neural networks (CNNs) based supervised meth-ods [8, 14, 20, 22, 30] are proposed in music understanding tasks. They usuallyemploy variant depth of convolutional layers on Mel-spectrogram based repre-sentations or raw waveform signals of the music to learn effective music repre-sentation, and append fully connected layers to predict relevant annotation likemusic genres, tags. However, training such CNN-based models usually requiresmassive human-annotated data. And in [6, 17], researchers show that comparedwith supervised learning methods, using self-supervision on unlabeled data cansignificantly improve the robustness of the model. Recently, the self-attentiontransformer has shown promising results in symbolic music generation area. Forexample, Music Transformer [18] and Pop Music Transformer [19] employ rela-tive attention to capture long-term structure from music MIDI data, which canbe used as discrete word tokens directly. However, compared with raw musicaudio, the size of existing MIDI dataset is limited. Moreover, transcription fromraw audio to MIDI files is time-consuming and not accurate. In this paper, weproposed MusiCoder, a universal music-acoustic encoder based on transformers.Specifically, MusiCoder is first pre-trained on massive unlabeled music acousticdata, and then finetuned on specific downstream music annotation tasks usinglabeled data.
A universal transformer-based encoder named MusiCoder is presented for musicacoustic representation learning. The system overview of proposed MusiCoderis shown in Fig. 1.
For each input frame t i , its vector representation x i is obtained by first projecting t i linearly to hidden dimension H dim , and then added with sinusoidal positionalencoding [37] defined as following: P E ( pos, i ) = sin ( pos/ i/H dim ) P E ( pos, i +1) = cos ( pos/ i/H dim ) (1)The positional encoding is used to inject information about the relative positionof acoustic frames. The design of positional encoding makes the transformerencoder aware of the music sequence order. usiCoder 5 A multi-layer bidirectional self-attention transformer encoder [37] is used to en-code the input music acoustic frames. Specifically, a L -layer transformer is usedto encode the input vectors X = { x i } Ni =1 as: H l = T ransf ormer l ( H l − ) (2)where l ∈ [1 , L ], H = X and H L = [ h L , ..., h LN ]. We use the hidden vector h Li as the contextualized representation of the input token t i . The architecture oftransformer encoder is shown in Fig. 1. The main idea of masked reconstruction pre-training is to perturb the inputsby randomly masking tokens with some probability, and reconstruct the maskedtokens at the output. In the pre-training process, a reconstruction module, whichconsists of two layers of feed-forward network with GeLU activation [16] andlayer-normalization [1], is appended to predict the masked inputs. The moduleuses the output of the last MusiCoder encoder layer as its input. Moreover,two new pre-training objectives are presented to help MusiCoder learn acousticmusic representation.
Objective 1: Contiguous Frames Masking (CFM).
To avoid the modelexploiting local smoothness of acoustic frames, rather than only mask one spanwith fixed number of consecutive frames [24], we mask several spans of consecu-tive frames dynamically. Given a sequence of input frames X = ( x , x , ..., x n ),we select a subset Y ⊂ X by iteratively sampling contiguous input frames (spans)until the masking budget (e.g., 15% of X ) has been spent. At each iteration, thespan length is first sampled from a geometric distribution (cid:96) ∼ Geo ( p ). Thenthe starting point of the masked span is randomly selected. We set p = 0 . (cid:96) min = 2 and (cid:96) max = 7. The corresponding mean length of span is around 3.87frames ( ≈ . ms ). In each masked span, the frames are masked according tothe following policy: 1) replace all frames with zero in 70% of the case. Sinceeach dimension of input frames are normalized to have zero mean value, set-ting the masked value to zero is equivalent to setting it to the mean value. 2)replace all frames with a random masking frame in 20% of the case. 3) keepthe original frames unchanged in the rest 10% of the case. Since MusiCoderwill only receive acoustic frames without masking during inference time, policy3) allows the model to receive real inputs during pre-training, and resolves thepretrain-fintune inconsistency problem [12]. Objective 2: Contiguous Channels Masking (CCM).
The intuition ofchannel masking is that a model available to predict the partial loss of chan-nel information should have learned a high-level understanding along the chan-nel axis. For log-mel spectrum and log-CQT features, a block of consecutivechannels is randomly masked to zero for all time steps across the input se-quence of frames. Specifically, the number of masked blocks, n , is first sampledfrom { , , . . . , H } uniformly. Then a starting channel index is sampled from { , , . . . , H − n } , where H is the number of total channels. Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang
Pre-training Objective Function.
Loss ( x ) = (cid:26) . · x if | x | < | x | − . otherwise (3)The Huber Loss [15] is used to minimize reconstruction error between maskedinput features and corresponding encoder output. Huber Loss is a robust L1 lossthat is less sensitive to outliers. And in our preliminary experiments, we foundthat compared with L1 loss used in [24], using Huber loss will make the trainingprocess easier to converge. We primarily report experimental results on two models: MusiCoderBase andMusiCoderLarge. The model settings are listed in Table 1. The number of Trans-former block layers, the size of hidden vectors, the number of self-attention headsare represented as L num , H dim , A num , respectively.Table 1. The proposed model settings L num H dim A num Table 2. Statistics on the datasets used for pre-training and downstream tasks
Task Datasets For MTG-Jamendo dataset, we removed music clips used in Auto-tagging taskwhen pre-training.
As shown in Table 2, the pre-training data were aggregated from three datasets:Music4all [13], FMA-Large [11] and MTG-Jamendo [4]. Both Music4all andFMA-Large datasets provide 30-seconds audio clips in .mp3 format for eachsong. And MTG-Jamendo dataset contains 55.7K music tracks, each with a du-ration of more than 30s. Since the maximum time stamps of MusiCoder is setto 1600, those music tracks exceeding 35s would be cropped into several musicclips, the duration of which was randomly picked from 10s to 35s.GTZAN music genre classification [35] and MTG-Jamendo music auto-taggingtasks [4] were used to evaluate the performance of finetuned MusiCoder. GTZAN usiCoder 7 consists of 1000 music clips divided into ten different genres (blues, classical,country, disco, hip-hop, jazz, metal, pop, reggae & rock). Each genre consistsof 100 music clips in .wav format with a duration of 30s. To avoid seeing anytest data in downstream tasks, for pre-training data, we filtered out those musicclips appearing in downstream tasks.
Audio Preprocess.
The acoustic music analysis library, Librosa [27], pro-vides flexible ways to extract features related to timbre, harmony, and rhythmaspect of music. In our work, Librosa was used to extract the following featuresfrom a given music clip: Mel-scaled Spectrogram, Constant-Q Transform (CQT),Mel-frequency cepstral coefficients (MFCCs), MFCCs delta and Chromagram,as detailed in Table 4. Each kind of features was extracted at the sampling rateof 44,100Hz, with a Hamming window size of 2048 samples ( ≈
46 ms) and a hopsize of 1024 samples ( ≈
23 ms). The Mel Spectrogram and CQT features weretransformed to log amplitude with S (cid:48) = ln (10 · S + (cid:15) ), where S , (cid:15) represents thefeature and an extremely small number, respectively. Then Cepstral Mean andVariance Normalization (CMVN) [31, 38] were applied to the extracted featuresfor minimizing distortion caused by noise contamination. Finally these normal-ized features were concatenated to a 324-dim feature, which was later used asthe input of MusiCoder.Table 3. Acoustic features of music extracted by Librosa Feature Characteristic DimensionChromagram Melody, Harmony 12MFCCs Pitch 20MFCCs delta Pitch 20Mel-scaled Spectrogram Raw Waveform 128Constant-Q Transform Raw Waveform 144
All our experiments were conducted on 5 GTX 2080Ti and can be reproducedon any machine with GPU memory more than 48GBs. In pre-training stage,MusiCoderBase and MusiCoderLarge were trained with a batch size of 64 for200k and 500k steps, respectively. We applied the Adam optimizer [21] with β = 0 . β = 0 .
999 and (cid:15) = 10 − . And the learning rate were varied withwarmup schedule [37] according to the formula: lrate = H − . dim · min ( step num − . , step num · warmup steps − . ) (4)where warmup steps was set as 8000. Moreover, library Apex was used to ac-celerate the training process and save GPU memory.For downstream tasks, we performed an exhaustive search on the followingsets of parameters. The model that performed the best on the validation setwas selected. All the other training parameters remained the same as those inpre-training stage: Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang
Table 4. Parameter settings for downstream tasksParameter Candidate ValueBatch size 16, 24, 32Learning Rate 2e-5, 3e-5, 5e-5Epoch 2, 3, 4Dropout Rate 0.05, 0.1
Table 5. Results of GTZAN Music Classification task
Models accuracyhand-crafted features + SVM [3] 87.9%CNN + SVM [8] 89.8%CNN+MLP based ensemble [14] 94.2%
MusiCoderBase 94.2%MusiCoderLarge 94.3%
Theoretical Maximum Score [35] 94.5%
Since GTZAN dataset only contains 1000 music clips, the experiments wereconducted in a ten-fold cross-validation setup. For each fold, 80, 20 songs ofeach genre were randomly selected and placed into the training and validationsplit, respectively. The ten-fold average accuracy score is shown in Table 5. Inprevoious work, [3] applied low-level music features and rich statistics to pre-dict music genres. In [8], researchers first used a CNN based model, which wastrained on music auto-tagging tasks, to extract features. These extracted fea-tures were then applied on SVM [36] for genre classification. In [14], the authorstrained two models: a CNN based model trained on a variety of spectral andrhythmic features, and an MLP network trained on features, which were ex-tracted from a model for music auto-tagging task. Then these two models werecombined in a majority voting ensemble setup. The authors reported the accu-racy score as 94.2%. Although some other works reported their accuracy scorehigher than 94.5%, we set 94.5% as the state-of-the-art accuracy according tothe analysis in [35], which demonstrates that the inherent noise (e.g., repetitions,mis-labelings, distortions of the songs) in GTZAN dataset prevents the perfectaccuracy score from surpassing 94.5%. In the experiment, MusiCoderBase andMusiCoderLarge achieve accuracy of 94.2% and 94.3%, respectively. The pro-posed models outperform the state-of-the-art models and achieve accuracy scoreclose to the ideal value.
For the music auto-tagging task, two sets of performance measurements, ROC-AUC macro and PR-AUC macro, were applied. ROC-AUC can lead to over- usiCoder 9
Table 6. Results of MTG-Jamendo Music Auto-tagging taskModels ROC-AUCmacro PR-AUCmacroVQ-VAE+CNN [20] 72.07% 10.76%VGGish [4] 72.58% 10.77%CRNN [22] 73.80% 11.71%FA-ResNet [22] 75.75% 14.63%SampleCNN (reproduced) [30] 76.93% 14.92%Shake-FA-ResNet [22] 77.17% 14.80%Ours MusiCoderBase w/o pre-training 77.03% 15.02%MusiCoderBase with CCM 81.93% 19.49%MusiCoderBase with CFM 81.38% 19.51%
MusiCoderBase with CFM+CCM 82.57% 20.87%MusiCoderLarge with CFM+CCM 83.82% 22.01% optimistic scores when data are unbalanced [10]. Since the music tags given inthe MTG-Jamendo dataset are highly unbalanced [4], the PR-AUC metric wasalso introduced for evaluation. The MusiCoder model was compared with otherstate-of-the-art models competing in the challenge of MediaEval 2019: Emo-tion and Theme Recognition in Music Using Jamendo [4]. We used the sametrain-valid-test data splits as the challenge. The results are shown in Table 6.For VQ-VAE+CNN [20], VGGish [4], CRNN [22], FA-ResNet [22], Shake-FA-ResNet [22] models, we directly used the evaluation results posted in the com-petition leaderboard . For SampleCNN [30], we reproduced the work accordingto the official implementation . As the results suggest, the proposed MusiCodermodel achieves new state-of-the-art results in music auto-tagging task. Ablation Study.
Ablation study were conducted to better understand theperformance of MusiCoder. The results are also shown in Table 6. According tothe experiemnt, even without pre-training, MusiCoderBase can still outperformmost SOTA models, which indicates the effectiveness of transformer-based ar-chitecture. When MusiCoderBase is pre-trained with objective CCM or CFMonly, a signicant improvement over MusiCoderBase without pre-training is ob-served. And MusiCoderBase with CCM and CFM pre-training objectives com-bined achieves better results. The improvement indicates the effectiveness ofpre-training stage. And it shows that the designed pre-training objectives CCMand CFM are both the key elements that drives pre-trained MusiCoder to learna powerful music acoustic representation. We also explore the effect of model sizeon downstream task accuracy. In the experiment, MusiCoderLarge outperformsMusiCoderBase, which reflects that increasing the model size of MusiCoder willlead to continual improvements. https://multimediaeval.github.io/2019-Emotion-and-Theme-Recognition-in-Music-Task/results https://github.com/tae-jun/sample-cnn In this paper, we propose MusiCoder, a universal music-acoustic encoder basedon transformers. Rather than relying on massive human labeled data which isexpensive and time consuming to collect, MusiCoder can learn a strong musicrepresentation from unlabeled music acoustic data. Two new pre-training ob-jectives Contiguous Frames Masking (CFM) and Contiguous Channel Masking(CCM) are designed to improve the pre-training stage in continuous acousticframe domain. The effectiveness of proposed objectives is evaluated throughextensive ablation studies. Moreover, MusiCoder outperforms the state-of-the-art model in music genre classification on GTZAN dataset and music auto-tagging on MTG-Jamendo dataset. Our work shows a great potential of adaptingtransformer-based masked reconstruction pre-training scheme to MIR commu-nity. Beyond improving the model, we plan to extend MusiCoder to other musicunderstanding tasks (e.g., music emotion recognition, chord estimation, musicsegmentation). We believe the future prospects for large scale representationlearning from music acoustic data look quite promising.
References
1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprintarXiv:1607.06450 (2016)2. Baevski, A., Schneider, S., Auli, M.: vq-wav2vec: Self-supervised learning of dis-crete speech representations. arXiv preprint arXiv:1910.05453 (2019)3. Baniya, B.K., Lee, J., Li, Z.N.: Audio feature reduction and analysis for automaticmusic genre classification. In: 2014 IEEE International Conference on Systems,Man, and Cybernetics (SMC). pp. 457–462. IEEE (2014)4. Bogdanov, D., Won, M., Tovstogan, P., Porter, A., Serra, X.: The mtg-jamendodataset for automatic music tagging. In: Machine Learning for Music DiscoveryWorkshop, International Conference on Machine Learning (ICML 2019). LongBeach, CA, United States (2019), http://hdl.handle.net/10230/42015
5. Bu, J., Tan, S., Chen, C., Wang, C., Wu, H., Zhang, L., He, X.: Music recom-mendation by unified hypergraph: combining social media information and musiccontent. In: Proceedings of the 18th ACM international conference on Multimedia.pp. 391–400 (2010)6. Carmon, Y., Raghunathan, A., Schmidt, L., Duchi, J.C., Liang, P.S.: Unlabeleddata improves adversarial robustness. In: Advances in Neural Information Process-ing Systems. pp. 11192–11203 (2019)7. Chi, P.H., Chung, P.H., Wu, T.H., Hsieh, C.C., Li, S.W., Lee, H.y.: Audio al-bert: A lite bert for self-supervised learning of audio representation. arXiv preprintarXiv:2005.08575 (2020)8. Choi, K., Fazekas, G., Sandler, M., Cho, K.: Transfer learning for music classifica-tion and regression tasks. arXiv preprint arXiv:1703.09179 (2017)9. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: Pre-training text en-coders as discriminators rather than generators. arXiv preprint arXiv:2003.10555(2020)10. Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves.In: Proceedings of the 23rd international conference on Machine learning. pp. 233–240 (2006)usiCoder 1111. Defferrard, M., Benzi, K., Vandergheynst, P., Bresson, X.: Fma: A dataset formusic analysis. arXiv preprint arXiv:1612.01840 (2016)12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018)13. Domingues, M., Pegoraro Santana, I., Pinhelli, F., Donini, J., Catharin, L., Man-golin, R., Costa, Y., Feltrim, V.D.: Music4all: A new music database and its ap-plications (07 2020). https://doi.org/10.1109/IWSSIP48289.2020.914517014. Ghosal, D., Kolekar, M.H.: Music genre recognition using deep neural networksand transfer learning. In: Interspeech. pp. 2087–2091 (2018)15. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference oncomputer vision. pp. 1440–1448 (2015)16. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprintarXiv:1606.08415 (2016)17. Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learningcan improve model robustness and uncertainty. In: Advances in Neural InformationProcessing Systems. pp. 15663–15674 (2019)18. Huang, C.Z.A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N.,Dai, A.M., Hoffman, M.D., Dinculescu, M., Eck, D.: Music transformer: Gener-ating music with long-term structure. In: International Conference on LearningRepresentations (2018)19. Huang, Y.S., Yang, Y.H.: Pop music transformer: Generating music with rhythmand harmony. arXiv preprint arXiv:2002.00212 (2020)20. Hung, H.T., Chen, Y.H., Mayerl, M., Zangerle, M.V.E., Yang, Y.H.: Mediaeval2019 emotion and theme recognition task: A vq-vae based approach21. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)22. Koutini, K., Chowdhury, S., Haunschmid, V., Eghbal-zadeh, H., Widmer, G.: Emo-tion and theme recognition in music with frequency-aware rf-regularized cnns.arXiv preprint arXiv:1911.05833 (2019)23. Ling, S., Liu, Y., Salazar, J., Kirchhoff, K.: Deep contextualized acoustic repre-sentations for semi-supervised speech recognition. In: ICASSP 2020-2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP).pp. 6429–6433. IEEE (2020)24. Liu, A.T., Yang, S.w., Chi, P.H., Hsu, P.c., Lee, H.y.: Mockingjay: Unsupervisedspeech representation learning with deep bidirectional transformer encoders. In:ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP). pp. 6419–6423. IEEE (2020)25. Liu, Y., Liu, Y., Zhang, X., Chen, G., Zhang, K.: Learning music emotion primitivesvia supervised dynamic clustering. In: Proceedings of the 24th ACM internationalconference on Multimedia. pp. 222–226 (2016)26. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretrainingapproach. arXiv preprint arXiv:1907.11692 (2019)27. McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., Nieto,O.: librosa: Audio and music signal analysis in python. In: Proceedings of the 14thpython in science conference. vol. 8, pp. 18–25 (2015)28. Van den Oord, A., Dieleman, S., Schrauwen, B.: Deep content-based music recom-mendation. In: Advances in neural information processing systems. pp. 2643–2651(2013)2 Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang29. Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D., Le, Q.V.:Specaugment: A simple data augmentation method for automatic speech recogni-tion. arXiv preprint arXiv:1904.08779 (2019)30. Pons, J., Nieto, O., Prockup, M., Schmidt, E., Ehmann, A., Serra, X.: End-to-endlearning for music audio tagging at scale. arXiv preprint arXiv:1711.02520 (2017)31. Pujol, P., Macho, D., Nadeu, C.: On real-time mean-and-variance normalization ofspeech recognition features. In: 2006 IEEE international conference on acousticsspeech and signal processing proceedings. vol. 1, pp. I–I. IEEE (2006)32. Qiu, Z., Ren, Y., Li, C., Liu, H., Huang, Y., Yang, Y., Wu, S., Zheng, H., Ji, J., Yu,J., et al.: Mind band: a crossmedia ai music composing platform. In: Proceedingsof the 27th ACM International Conference on Multimedia. pp. 2231–2233 (2019)33. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving languageunderstanding by generative pre-training. URL https://s3-us-west-2. amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understandingpaper. pdf (2018)34. Song, X., Wang, G., Wu, Z., Huang, Y., Su, D., Yu, D., Meng, H.: Speech-xlnet:Unsupervised acoustic model pretraining for self-attention networks. arXiv preprintarXiv:1910.10387 (2019)35. Sturm, B.L.: The gtzan dataset: Its contents, its faults, their effects on evaluation,and its future use. arXiv preprint arXiv:1306.1461 (2013)36. Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers.Neural processing letters (3), 293–300 (1999)37. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,(cid:32)L., Polosukhin, I.: Attention is all you need. In: Advances in neural informationprocessing systems. pp. 5998–6008 (2017)38. Viikki, O., Laurila, K.: Cepstral domain segmental feature vector normalization fornoise robust speech recognition. Speech Communication (1-3), 133–147 (1998)39. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi-task benchmark and analysis platform for natural language understanding. arXivpreprint arXiv:1804.07461 (2018)40. Wang, W., Tang, Q., Livescu, K.: Unsupervised pre-training of bidirectional speechencoders via masked reconstruction. In: ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6889–6893.IEEE (2020)41. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet:Generalized autoregressive pretraining for language understanding. In: Advancesin neural information processing systems. pp. 5753–5763 (2019)42. Zhang, K., Sun, S.: Web music emotion recognition based on higher effective geneexpression programming. Neurocomputing105