[PDF] MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers

Abstract

Music annotation has always been one of the critical topics in the field of Music Information Retrieval (MIR). Traditional models use supervised learning for music annotation tasks. However, as supervised machine learning approaches increase in complexity, the increasing need for more annotated training data can often not be matched with available data. In this paper, a new self-supervised music acoustic representation learning approach named MusiCoder is proposed. Inspired by the success of BERT, MusiCoder builds upon the architecture of self-attention bidirectional transformers. Two pre-training objectives, including Contiguous Frames Masking (CFM) and Contiguous Channels Masking (CCM), are designed to adapt BERT-like masked reconstruction pre-training to continuous acoustic frame domain. The performance of MusiCoder is evaluated in two downstream music annotation tasks. The results show that MusiCoder outperforms the state-of-the-art models in both music genre classification and auto-tagging tasks. The effectiveness of MusiCoder indicates a great potential of a new self-supervised learning approach to understand music: first apply masked reconstruction tasks to pre-train a transformer-based model with massive unlabeled music acoustic data, and then finetune the model on specific downstream tasks with labeled data.

Full PDF

MMusiCoder: A Universal Music-AcousticEncoder Based on Transformers

Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang ∗ Zhejiang University, Hangzhou, PR China { zhaoyilun, wuxinda, jiuzhou, zhangkejun } @[email protected] Abstract.

Music annotation has always been one of the critical topicsin the ﬁeld of Music Information Retrieval (MIR). Traditional models usesupervised learning for music annotation tasks. However, as supervisedmachine learning approaches increase in complexity, the increasing needfor more annotated training data can often not be matched with availabledata. Moreover, over-reliance on labeled data when training supervisedlearning models can lead to unexpected results and open vulnerabilitiesfor adversarial attacks. In this paper, a new self-supervised music acous-tic representation learning approach named MusiCoder is proposed. In-spired by the success of BERT, MusiCoder builds upon the architectureof self-attention bidirectional transformers. Two pre-training objectives,including Contiguous Frames Masking (CFM) and Contiguous ChannelsMasking (CCM), are designed to adapt BERT-like masked reconstruc-tion pre-training to continuous acoustic frame domain. The performanceof MusiCoder is evaluated in two downstream music annotation tasks.The results show that MusiCoder outperforms the state-of-the-art mod-els in both music genre classiﬁcation and auto-tagging tasks. The eﬀec-tiveness of MusiCoder indicates a great potential of a new self-supervisedlearning approach to understand music: ﬁrst apply masked reconstruc-tion tasks to pre-train a transformer-based model with massive unlabeledmusic acoustic data, and then ﬁnetune the model on speciﬁc downstreamtasks with labeled data.

Keywords:

Music Information Retrieval · Self-supervised Representa-tion Learning · Masked Reconstruction · Transformer

The amount of music has been growing rapidly over the past decades. As aneﬀective measure for utilizing massive music data, automatically assigning onemusic clip a set of relevant tags, providing high-level descriptions about the musicclip such as genre, emotion, theme, are of great signiﬁcance in MIR community,e.g., music recommendation [5, 28], music emotion recognition [25, 42, 43], mu-sic composition [32]. Some researchers have applied several supervised learningmodels [14, 20, 22, 30], which are trained on human-annotated music data. How-ever, the performance of supervised learning method are likely to be limited bythe size of labeled dataset, which is expensive and time consuming to collect. a r X i v : . [ ee ss . A S ] A ug Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang

Moreover, compared with self-supervised learning, supervised learning approachis more likely to be attacked because of over-reliance on labeled data [6, 17].Recently, self-supervised pre-training models [23,24,34,40], especially BERT,dominate Natural Language Processing (NLP) community. BERT proposes aMasked Language Model (MLM) pre-training objective, which can learn a pow-erful language representation by reconstructing the masked input sequences inpre-training stage. The intuition behind this design is that a model available torecover the missing content should have learned a good contextual representa-tion. In particular, BERT and its variants [26, 41, 44] have reached signiﬁcantimprovements on various NLP benchmark tasks [39]. Compared with the textdomain whose inputs are discrete word tokens, in acoustics domain, the inputsare usually multi-dimensional feature vectors (e.g., energy in multiple frequencybands) of each frame, which are continuous and smoothly changed over time.Therefore, some particular designs need to be introduced to bridge the gaps be-tween discrete text and contiguous acoustic frames. We are the ﬁrst to apply theidea of masked reconstruction pre-training to the continuous music acoustic do-main. In this paper, a new self-supervised pre-training scheme called MusiCoderis proposed, which can learn a powerful acoustic music representations throughreconstructing masked acoustic frame sequence in pre-training stage.Our contributions can be summarized as:1. We present a new self-supervised pre-training model named MusiCoder. Mu-siCoder builds upon the structure of multi-layer bidirectional self-attentiontransformers. Rather than relying on massive human-labeled data, Musi-Coder can learn a powerful music representation from unlabeled music acous-tic data, which is much easier to collect.2. The reconstruction procedure of BERT-like model is adapted from classi-cation task to regression task. In other word, MusiCoder can reconstructcontinuous acoustic frames directly, which avoids an extra transformationfrom continuous frames to discrete word tokens before pre-training.3. Two pre-training objectives, including Contiguous Frames Masking (CFM)and Contiguous Channels Masking (CCM), are proposed to pre-train Musi-Coder. The ablation study shows that both CFM and CCM objectives caneﬀectively improve the performance of MusiCoder pre-training.4. The MusiCoder is evaluated on two downstream tasks: GTZAN music genreclassiﬁcation and MTG-Jamendo music auto-tagging. And MusiCoder out-performs the SOTA model in both tasks. The success of MusiCoder indicatesa great potential of applying transformer-based masked reconstruction pre-training in Music Information Retrieval (MIR) ﬁeld.

In the past few years, pre-training models and self-supervised representationlearning have achieved great success in NLP community. Huge amount of self-supervised pre-training models based on multi-layer self-attention transform-ers [37], such as BERT [12], GPT [33], XLNet [41], Electra [9] are proposed. usiCoder 3

Fig. 1. System overview of the MusiCoder modelAmong them, BERT is perhaps the most classic and popular one due to its sim-plicity and outstanding performance. Speciﬁcally, BERT is designed to recon-struct the masked input sequences in pre-training stage. Through reconstructingthe missing content from a given masked sequence, the model can learn a pow-erful contextual representation.More recently, the success of BERT in NLP community draws the atten-tion of researchers in acoustic signal processing ﬁeld. Some pioneering works[2, 23, 24, 34, 40] have shown the eﬀectiveness of adapting BERT to AutomaticSpeech Recognition (ASR) research. Speciﬁcally, they design some speciﬁc pre-training objectives to bridge the gaps between discrete text and contiguous

Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang acoustic frames. In vq-wav2vec [2], input speech audio is ﬁrst discretized toa K-way quantized embedding space by learning discrete representation fromaudio samples. However, the quantization process requires massive comput-ing resources and is against the continuous nature of acoustic frames. Someworks [7, 23, 24, 34, 40] design a modied version of BERT to directly utilize con-tinuous speech. In [7, 23, 24], continuous frame-level masked reconstruction isadapted in BERT-like pre-training stage. In [40], SpecAugment [29] is appliedto mask input frames. And [34] learns by reconstructing from shuﬄed acousticframe orders rather than masked frames.As for MIR community, representation learning has been popular for manyyears. Several convolutional neural networks (CNNs) based supervised meth-ods [8, 14, 20, 22, 30] are proposed in music understanding tasks. They usuallyemploy variant depth of convolutional layers on Mel-spectrogram based repre-sentations or raw waveform signals of the music to learn eﬀective music repre-sentation, and append fully connected layers to predict relevant annotation likemusic genres, tags. However, training such CNN-based models usually requiresmassive human-annotated data. And in [6, 17], researchers show that comparedwith supervised learning methods, using self-supervision on unlabeled data cansigniﬁcantly improve the robustness of the model. Recently, the self-attentiontransformer has shown promising results in symbolic music generation area. Forexample, Music Transformer [18] and Pop Music Transformer [19] employ rela-tive attention to capture long-term structure from music MIDI data, which canbe used as discrete word tokens directly. However, compared with raw musicaudio, the size of existing MIDI dataset is limited. Moreover, transcription fromraw audio to MIDI ﬁles is time-consuming and not accurate. In this paper, weproposed MusiCoder, a universal music-acoustic encoder based on transformers.Speciﬁcally, MusiCoder is ﬁrst pre-trained on massive unlabeled music acousticdata, and then ﬁnetuned on speciﬁc downstream music annotation tasks usinglabeled data.

A universal transformer-based encoder named MusiCoder is presented for musicacoustic representation learning. The system overview of proposed MusiCoderis shown in Fig. 1.

For each input frame t i , its vector representation x i is obtained by ﬁrst projecting t i linearly to hidden dimension H dim , and then added with sinusoidal positionalencoding [37] deﬁned as following: P E ( pos, i ) = sin ( pos/ i/H dim ) P E ( pos, i +1) = cos ( pos/ i/H dim ) (1)The positional encoding is used to inject information about the relative positionof acoustic frames. The design of positional encoding makes the transformerencoder aware of the music sequence order. usiCoder 5 A multi-layer bidirectional self-attention transformer encoder [37] is used to en-code the input music acoustic frames. Speciﬁcally, a L -layer transformer is usedto encode the input vectors X = { x i } Ni =1 as: H l = T ransf ormer l ( H l − ) (2)where l ∈ [1 , L ], H = X and H L = [ h L , ..., h LN ]. We use the hidden vector h Li as the contextualized representation of the input token t i . The architecture oftransformer encoder is shown in Fig. 1. The main idea of masked reconstruction pre-training is to perturb the inputsby randomly masking tokens with some probability, and reconstruct the maskedtokens at the output. In the pre-training process, a reconstruction module, whichconsists of two layers of feed-forward network with GeLU activation [16] andlayer-normalization [1], is appended to predict the masked inputs. The moduleuses the output of the last MusiCoder encoder layer as its input. Moreover,two new pre-training objectives are presented to help MusiCoder learn acousticmusic representation.

Objective 1: Contiguous Frames Masking (CFM).

To avoid the modelexploiting local smoothness of acoustic frames, rather than only mask one spanwith ﬁxed number of consecutive frames [24], we mask several spans of consecu-tive frames dynamically. Given a sequence of input frames X = ( x , x , ..., x n ),we select a subset Y ⊂ X by iteratively sampling contiguous input frames (spans)until the masking budget (e.g., 15% of X ) has been spent. At each iteration, thespan length is ﬁrst sampled from a geometric distribution (cid:96) ∼ Geo ( p ). Thenthe starting point of the masked span is randomly selected. We set p = 0 . (cid:96) min = 2 and (cid:96) max = 7. The corresponding mean length of span is around 3.87frames ( ≈ . ms ). In each masked span, the frames are masked according tothe following policy: 1) replace all frames with zero in 70% of the case. Sinceeach dimension of input frames are normalized to have zero mean value, set-ting the masked value to zero is equivalent to setting it to the mean value. 2)replace all frames with a random masking frame in 20% of the case. 3) keepthe original frames unchanged in the rest 10% of the case. Since MusiCoderwill only receive acoustic frames without masking during inference time, policy3) allows the model to receive real inputs during pre-training, and resolves thepretrain-ﬁntune inconsistency problem [12]. Objective 2: Contiguous Channels Masking (CCM).

The intuition ofchannel masking is that a model available to predict the partial loss of chan-nel information should have learned a high-level understanding along the chan-nel axis. For log-mel spectrum and log-CQT features, a block of consecutivechannels is randomly masked to zero for all time steps across the input se-quence of frames. Speciﬁcally, the number of masked blocks, n , is ﬁrst sampledfrom { , , . . . , H } uniformly. Then a starting channel index is sampled from { , , . . . , H − n } , where H is the number of total channels. Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang

Pre-training Objective Function.

Loss ( x ) = (cid:26) . · x if | x | < | x | − . otherwise (3)The Huber Loss [15] is used to minimize reconstruction error between maskedinput features and corresponding encoder output. Huber Loss is a robust L1 lossthat is less sensitive to outliers. And in our preliminary experiments, we foundthat compared with L1 loss used in [24], using Huber loss will make the trainingprocess easier to converge. We primarily report experimental results on two models: MusiCoderBase andMusiCoderLarge. The model settings are listed in Table 1. The number of Trans-former block layers, the size of hidden vectors, the number of self-attention headsare represented as L num , H dim , A num , respectively.Table 1. The proposed model settings L num H dim A num Table 2. Statistics on the datasets used for pre-training and downstream tasks

Task Datasets For MTG-Jamendo dataset, we removed music clips used in Auto-tagging taskwhen pre-training.

As shown in Table 2, the pre-training data were aggregated from three datasets:Music4all [13], FMA-Large [11] and MTG-Jamendo [4]. Both Music4all andFMA-Large datasets provide 30-seconds audio clips in .mp3 format for eachsong. And MTG-Jamendo dataset contains 55.7K music tracks, each with a du-ration of more than 30s. Since the maximum time stamps of MusiCoder is setto 1600, those music tracks exceeding 35s would be cropped into several musicclips, the duration of which was randomly picked from 10s to 35s.GTZAN music genre classiﬁcation [35] and MTG-Jamendo music auto-taggingtasks [4] were used to evaluate the performance of ﬁnetuned MusiCoder. GTZAN usiCoder 7 consists of 1000 music clips divided into ten diﬀerent genres (blues, classical,country, disco, hip-hop, jazz, metal, pop, reggae & rock). Each genre consistsof 100 music clips in .wav format with a duration of 30s. To avoid seeing anytest data in downstream tasks, for pre-training data, we ﬁltered out those musicclips appearing in downstream tasks.

Audio Preprocess.

The acoustic music analysis library, Librosa [27], pro-vides ﬂexible ways to extract features related to timbre, harmony, and rhythmaspect of music. In our work, Librosa was used to extract the following featuresfrom a given music clip: Mel-scaled Spectrogram, Constant-Q Transform (CQT),Mel-frequency cepstral coeﬃcients (MFCCs), MFCCs delta and Chromagram,as detailed in Table 4. Each kind of features was extracted at the sampling rateof 44,100Hz, with a Hamming window size of 2048 samples ( ≈

46 ms) and a hopsize of 1024 samples ( ≈

23 ms). The Mel Spectrogram and CQT features weretransformed to log amplitude with S (cid:48) = ln (10 · S + (cid:15) ), where S , (cid:15) represents thefeature and an extremely small number, respectively. Then Cepstral Mean andVariance Normalization (CMVN) [31, 38] were applied to the extracted featuresfor minimizing distortion caused by noise contamination. Finally these normal-ized features were concatenated to a 324-dim feature, which was later used asthe input of MusiCoder.Table 3. Acoustic features of music extracted by Librosa Feature Characteristic DimensionChromagram Melody, Harmony 12MFCCs Pitch 20MFCCs delta Pitch 20Mel-scaled Spectrogram Raw Waveform 128Constant-Q Transform Raw Waveform 144

All our experiments were conducted on 5 GTX 2080Ti and can be reproducedon any machine with GPU memory more than 48GBs. In pre-training stage,MusiCoderBase and MusiCoderLarge were trained with a batch size of 64 for200k and 500k steps, respectively. We applied the Adam optimizer [21] with β = 0 . β = 0 .

999 and (cid:15) = 10 − . And the learning rate were varied withwarmup schedule [37] according to the formula: lrate = H − . dim · min ( step num − . , step num · warmup steps − . ) (4)where warmup steps was set as 8000. Moreover, library Apex was used to ac-celerate the training process and save GPU memory.For downstream tasks, we performed an exhaustive search on the followingsets of parameters. The model that performed the best on the validation setwas selected. All the other training parameters remained the same as those inpre-training stage: Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang

Table 4. Parameter settings for downstream tasksParameter Candidate ValueBatch size 16, 24, 32Learning Rate 2e-5, 3e-5, 5e-5Epoch 2, 3, 4Dropout Rate 0.05, 0.1

Table 5. Results of GTZAN Music Classiﬁcation task

Models accuracyhand-crafted features + SVM [3] 87.9%CNN + SVM [8] 89.8%CNN+MLP based ensemble [14] 94.2%

MusiCoderBase 94.2%MusiCoderLarge 94.3%

Theoretical Maximum Score [35] 94.5%

Since GTZAN dataset only contains 1000 music clips, the experiments wereconducted in a ten-fold cross-validation setup. For each fold, 80, 20 songs ofeach genre were randomly selected and placed into the training and validationsplit, respectively. The ten-fold average accuracy score is shown in Table 5. Inprevoious work, [3] applied low-level music features and rich statistics to pre-dict music genres. In [8], researchers ﬁrst used a CNN based model, which wastrained on music auto-tagging tasks, to extract features. These extracted fea-tures were then applied on SVM [36] for genre classiﬁcation. In [14], the authorstrained two models: a CNN based model trained on a variety of spectral andrhythmic features, and an MLP network trained on features, which were ex-tracted from a model for music auto-tagging task. Then these two models werecombined in a majority voting ensemble setup. The authors reported the accu-racy score as 94.2%. Although some other works reported their accuracy scorehigher than 94.5%, we set 94.5% as the state-of-the-art accuracy according tothe analysis in [35], which demonstrates that the inherent noise (e.g., repetitions,mis-labelings, distortions of the songs) in GTZAN dataset prevents the perfectaccuracy score from surpassing 94.5%. In the experiment, MusiCoderBase andMusiCoderLarge achieve accuracy of 94.2% and 94.3%, respectively. The pro-posed models outperform the state-of-the-art models and achieve accuracy scoreclose to the ideal value.

For the music auto-tagging task, two sets of performance measurements, ROC-AUC macro and PR-AUC macro, were applied. ROC-AUC can lead to over- usiCoder 9

Table 6. Results of MTG-Jamendo Music Auto-tagging taskModels ROC-AUCmacro PR-AUCmacroVQ-VAE+CNN [20] 72.07% 10.76%VGGish [4] 72.58% 10.77%CRNN [22] 73.80% 11.71%FA-ResNet [22] 75.75% 14.63%SampleCNN (reproduced) [30] 76.93% 14.92%Shake-FA-ResNet [22] 77.17% 14.80%Ours MusiCoderBase w/o pre-training 77.03% 15.02%MusiCoderBase with CCM 81.93% 19.49%MusiCoderBase with CFM 81.38% 19.51%

MusiCoderBase with CFM+CCM 82.57% 20.87%MusiCoderLarge with CFM+CCM 83.82% 22.01% optimistic scores when data are unbalanced [10]. Since the music tags given inthe MTG-Jamendo dataset are highly unbalanced [4], the PR-AUC metric wasalso introduced for evaluation. The MusiCoder model was compared with otherstate-of-the-art models competing in the challenge of MediaEval 2019: Emo-tion and Theme Recognition in Music Using Jamendo [4]. We used the sametrain-valid-test data splits as the challenge. The results are shown in Table 6.For VQ-VAE+CNN [20], VGGish [4], CRNN [22], FA-ResNet [22], Shake-FA-ResNet [22] models, we directly used the evaluation results posted in the com-petition leaderboard . For SampleCNN [30], we reproduced the work accordingto the oﬃcial implementation . As the results suggest, the proposed MusiCodermodel achieves new state-of-the-art results in music auto-tagging task. Ablation Study.

Ablation study were conducted to better understand theperformance of MusiCoder. The results are also shown in Table 6. According tothe experiemnt, even without pre-training, MusiCoderBase can still outperformmost SOTA models, which indicates the eﬀectiveness of transformer-based ar-chitecture. When MusiCoderBase is pre-trained with objective CCM or CFMonly, a signicant improvement over MusiCoderBase without pre-training is ob-served. And MusiCoderBase with CCM and CFM pre-training objectives com-bined achieves better results. The improvement indicates the eﬀectiveness ofpre-training stage. And it shows that the designed pre-training objectives CCMand CFM are both the key elements that drives pre-trained MusiCoder to learna powerful music acoustic representation. We also explore the eﬀect of model sizeon downstream task accuracy. In the experiment, MusiCoderLarge outperformsMusiCoderBase, which reﬂects that increasing the model size of MusiCoder willlead to continual improvements. https://multimediaeval.github.io/2019-Emotion-and-Theme-Recognition-in-Music-Task/results https://github.com/tae-jun/sample-cnn In this paper, we propose MusiCoder, a universal music-acoustic encoder basedon transformers. Rather than relying on massive human labeled data which isexpensive and time consuming to collect, MusiCoder can learn a strong musicrepresentation from unlabeled music acoustic data. Two new pre-training ob-jectives Contiguous Frames Masking (CFM) and Contiguous Channel Masking(CCM) are designed to improve the pre-training stage in continuous acousticframe domain. The eﬀectiveness of proposed objectives is evaluated throughextensive ablation studies. Moreover, MusiCoder outperforms the state-of-the-art model in music genre classiﬁcation on GTZAN dataset and music auto-tagging on MTG-Jamendo dataset. Our work shows a great potential of adaptingtransformer-based masked reconstruction pre-training scheme to MIR commu-nity. Beyond improving the model, we plan to extend MusiCoder to other musicunderstanding tasks (e.g., music emotion recognition, chord estimation, musicsegmentation). We believe the future prospects for large scale representationlearning from music acoustic data look quite promising.

References

1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprintarXiv:1607.06450 (2016)2. Baevski, A., Schneider, S., Auli, M.: vq-wav2vec: Self-supervised learning of dis-crete speech representations. arXiv preprint arXiv:1910.05453 (2019)3. Baniya, B.K., Lee, J., Li, Z.N.: Audio feature reduction and analysis for automaticmusic genre classiﬁcation. In: 2014 IEEE International Conference on Systems,Man, and Cybernetics (SMC). pp. 457–462. IEEE (2014)4. Bogdanov, D., Won, M., Tovstogan, P., Porter, A., Serra, X.: The mtg-jamendodataset for automatic music tagging. In: Machine Learning for Music DiscoveryWorkshop, International Conference on Machine Learning (ICML 2019). LongBeach, CA, United States (2019), http://hdl.handle.net/10230/42015

5. Bu, J., Tan, S., Chen, C., Wang, C., Wu, H., Zhang, L., He, X.: Music recom-mendation by uniﬁed hypergraph: combining social media information and musiccontent. In: Proceedings of the 18th ACM international conference on Multimedia.pp. 391–400 (2010)6. Carmon, Y., Raghunathan, A., Schmidt, L., Duchi, J.C., Liang, P.S.: Unlabeleddata improves adversarial robustness. In: Advances in Neural Information Process-ing Systems. pp. 11192–11203 (2019)7. Chi, P.H., Chung, P.H., Wu, T.H., Hsieh, C.C., Li, S.W., Lee, H.y.: Audio al-bert: A lite bert for self-supervised learning of audio representation. arXiv preprintarXiv:2005.08575 (2020)8. Choi, K., Fazekas, G., Sandler, M., Cho, K.: Transfer learning for music classiﬁca-tion and regression tasks. arXiv preprint arXiv:1703.09179 (2017)9. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: Pre-training text en-coders as discriminators rather than generators. arXiv preprint arXiv:2003.10555(2020)10. Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves.In: Proceedings of the 23rd international conference on Machine learning. pp. 233–240 (2006)usiCoder 1111. Deﬀerrard, M., Benzi, K., Vandergheynst, P., Bresson, X.: Fma: A dataset formusic analysis. arXiv preprint arXiv:1612.01840 (2016)12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018)13. Domingues, M., Pegoraro Santana, I., Pinhelli, F., Donini, J., Catharin, L., Man-golin, R., Costa, Y., Feltrim, V.D.: Music4all: A new music database and its ap-plications (07 2020). https://doi.org/10.1109/IWSSIP48289.2020.914517014. Ghosal, D., Kolekar, M.H.: Music genre recognition using deep neural networksand transfer learning. In: Interspeech. pp. 2087–2091 (2018)15. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference oncomputer vision. pp. 1440–1448 (2015)16. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprintarXiv:1606.08415 (2016)17. Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learningcan improve model robustness and uncertainty. In: Advances in Neural InformationProcessing Systems. pp. 15663–15674 (2019)18. Huang, C.Z.A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N.,Dai, A.M., Hoﬀman, M.D., Dinculescu, M., Eck, D.: Music transformer: Gener-ating music with long-term structure. In: International Conference on LearningRepresentations (2018)19. Huang, Y.S., Yang, Y.H.: Pop music transformer: Generating music with rhythmand harmony. arXiv preprint arXiv:2002.00212 (2020)20. Hung, H.T., Chen, Y.H., Mayerl, M., Zangerle, M.V.E., Yang, Y.H.: Mediaeval2019 emotion and theme recognition task: A vq-vae based approach21. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)22. Koutini, K., Chowdhury, S., Haunschmid, V., Eghbal-zadeh, H., Widmer, G.: Emo-tion and theme recognition in music with frequency-aware rf-regularized cnns.arXiv preprint arXiv:1911.05833 (2019)23. Ling, S., Liu, Y., Salazar, J., Kirchhoﬀ, K.: Deep contextualized acoustic repre-sentations for semi-supervised speech recognition. In: ICASSP 2020-2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP).pp. 6429–6433. IEEE (2020)24. Liu, A.T., Yang, S.w., Chi, P.H., Hsu, P.c., Lee, H.y.: Mockingjay: Unsupervisedspeech representation learning with deep bidirectional transformer encoders. In:ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP). pp. 6419–6423. IEEE (2020)25. Liu, Y., Liu, Y., Zhang, X., Chen, G., Zhang, K.: Learning music emotion primitivesvia supervised dynamic clustering. In: Proceedings of the 24th ACM internationalconference on Multimedia. pp. 222–226 (2016)26. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretrainingapproach. arXiv preprint arXiv:1907.11692 (2019)27. McFee, B., Raﬀel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., Nieto,O.: librosa: Audio and music signal analysis in python. In: Proceedings of the 14thpython in science conference. vol. 8, pp. 18–25 (2015)28. Van den Oord, A., Dieleman, S., Schrauwen, B.: Deep content-based music recom-mendation. In: Advances in neural information processing systems. pp. 2643–2651(2013)2 Yilun Zhao, Xinda Wu, Yuqing Ye, Jia Guo, Kejun Zhang29. Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D., Le, Q.V.:Specaugment: A simple data augmentation method for automatic speech recogni-tion. arXiv preprint arXiv:1904.08779 (2019)30. Pons, J., Nieto, O., Prockup, M., Schmidt, E., Ehmann, A., Serra, X.: End-to-endlearning for music audio tagging at scale. arXiv preprint arXiv:1711.02520 (2017)31. Pujol, P., Macho, D., Nadeu, C.: On real-time mean-and-variance normalization ofspeech recognition features. In: 2006 IEEE international conference on acousticsspeech and signal processing proceedings. vol. 1, pp. I–I. IEEE (2006)32. Qiu, Z., Ren, Y., Li, C., Liu, H., Huang, Y., Yang, Y., Wu, S., Zheng, H., Ji, J., Yu,J., et al.: Mind band: a crossmedia ai music composing platform. In: Proceedingsof the 27th ACM International Conference on Multimedia. pp. 2231–2233 (2019)33. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving languageunderstanding by generative pre-training. URL https://s3-us-west-2. amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understandingpaper. pdf (2018)34. Song, X., Wang, G., Wu, Z., Huang, Y., Su, D., Yu, D., Meng, H.: Speech-xlnet:Unsupervised acoustic model pretraining for self-attention networks. arXiv preprintarXiv:1910.10387 (2019)35. Sturm, B.L.: The gtzan dataset: Its contents, its faults, their eﬀects on evaluation,and its future use. arXiv preprint arXiv:1306.1461 (2013)36. Suykens, J.A., Vandewalle, J.: Least squares support vector machine classiﬁers.Neural processing letters (3), 293–300 (1999)37. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,(cid:32)L., Polosukhin, I.: Attention is all you need. In: Advances in neural informationprocessing systems. pp. 5998–6008 (2017)38. Viikki, O., Laurila, K.: Cepstral domain segmental feature vector normalization fornoise robust speech recognition. Speech Communication (1-3), 133–147 (1998)39. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi-task benchmark and analysis platform for natural language understanding. arXivpreprint arXiv:1804.07461 (2018)40. Wang, W., Tang, Q., Livescu, K.: Unsupervised pre-training of bidirectional speechencoders via masked reconstruction. In: ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6889–6893.IEEE (2020)41. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet:Generalized autoregressive pretraining for language understanding. In: Advancesin neural information processing systems. pp. 5753–5763 (2019)42. Zhang, K., Sun, S.: Web music emotion recognition based on higher eﬀective geneexpression programming. Neurocomputing105