Investigating the efficacy of music version retrieval systems for setlist identification
IINVESTIGATING THE EFFICACY OF MUSIC VERSION RETRIEVAL SYSTEMS FORSETLIST IDENTIFICATION
Furkan Yesiler (cid:63)
Emilio Molina † Joan Serr`a ‡ Emilia G´omez †† (cid:63)(cid:63) Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain † BMAT Licensing S.L., Barcelona, Spain ‡ Dolby Laboratories, Barcelona, Spain †† Joint Research Centre, European Commission, Sevilla, Spain
ABSTRACT
The setlist identification (SLI) task addresses a music recogni-tion use case where the goal is to retrieve the metadata and times-tamps for all the tracks played in live music events. Due to variousmusical and non-musical changes in live performances, developingautomatic SLI systems is still a challenging task that, despite its in-dustrial relevance, has been under-explored in the academic litera-ture. In this paper, we propose an end-to-end workflow that identifiesrelevant metadata and timestamps of live music performances usinga version identification system. We compare 3 of such systems to in-vestigate their suitability for this particular task. For developing andevaluating SLI systems, we also contribute a new dataset that con-tains 99.5 h of concerts with annotated metadata and timestamps,along with the corresponding reference set. The dataset is catego-rized by audio qualities and genres to analyze the performance ofSLI systems in different use cases. Our approach can identify 68% ofthe annotated segments, with values ranging from 35% to 77% basedon the genre. Finally, we evaluate our approach against a databaseof 56.8 k songs to illustrate the effect of expanding the reference set,where we can still identify 56% of the annotated segments.
Index Terms — Setlist identification, version identification, liveperformance monitoring, music recognition.
1. INTRODUCTION
Music recognition is commonly used to refer to the task of iden-tifying the presence of a known music track in an unknown audiostream, ideally along with its start and end timestamps [1, 2]. Whilethe most common and successful technologies for music recogni-tion are audio fingerprinting systems [3] that can identify recordingswith slight degradations (e.g., background noise [4], voiceovers [5],or pitch shifting [6–8]), they tend to perform poorly for live per-formance monitoring. Live performances can incorporate many al-terations from the studio recordings, including changes in tempo,key, structure, background noise, additional applause and banters,and so on. Therefore, identifying live music content typically re-quires version identification (VI) systems, which are designed to gobeyond near-exact duplicate detection to identify recordings that, al-though having perceptual differences, convey the same musical en-tity (e.g., live performances or cover songs) [9–13].The setlist identification (SLI) of live music performances(i.e., full concerts) stands as a challenging branch within musicrecognition. In the music information retrieval community, SLI was
This work is supported by the MIP-Frontiers project, the EuropeanUnion’s Horizon 2020 research and innovation programme under the MarieSkłodowska-Curie grant agreement No. 765068. formally defined by Wang et al. [14] and divided into two sequen-tial sub-tasks, where the first task aims to retrieve only the relatedmetadata in the correct order, and the second task concerns fur-ther processing of the retrieved items to obtain correct timestamps.The main applications of SLI systems include automatic generationof metadata and timestamps for concerts in streaming platforms(e.g., Youtube) and copyright management for the music industry.The vast variety of music usage contexts in digital platforms makes itimpossible to track music usage manually and, therefore, highlightsthe necessity of automatic systems.The work by Wang et al. [14], together with a few submissions toMIREX 2015 is, to the best of our knowledge, the only one specif-ically targeting the SLI task. Although some works in the VI litera-ture address the use case of live performance identification [15, 16],the proposed approaches and evaluation contexts do not consider en-tire concerts nor retrieving timestamps. Therefore, one cannot con-sider them as examples for SLI. Wang et al. [14] assume that theartist is known for each concert. For overlapping windows, a VIsystem is used to return a set of candidates using thumbnails, andthe final matches and their boundaries are identified among thosecandidates using a dynamic time warping algorithm. The system isevaluated on a dataset of 20 concerts from 10 rock bands, using threemetrics: edit distance, boundary deviation, and frame accuracy. Al-though demonstrating a plausible performance, scaling such a sys-tem to reference databases of thousands of songs stands as a difficultchallenge due to the computational complexities of the algorithmsused in each step.In this paper, we study the efficacy of current VI systems forSLI, considering a range of use cases related to the monitoring oflive performances. To mimic a realistic industrial scenario, our ap-proach combines the sub-tasks proposed by Wang et al. [14] into asingle task by creating an end-to-end workflow that takes audio sig-nals as input and creates a final document with the retrieved metadataand timestamps. By using predetermined window and hop sizes, wecompare the performances of three VI systems that produce over-lapping matches that we further process to create the final results.We develop and evaluate our system using a new dataset of 75 con-certs that are categorized by varying audio qualities and genres, andannotated in terms of the songs played in each concert and theirtimestamps. We study the impact of audio quality, genre, and ref-erence set size (up to 56.8 k songs) on system performance. Wereport our findings using 4 evaluation metrics following industrialpractices. We publish our dataset and evaluation code at https://github.com/furkanyesiler/setlist_id . a r X i v : . [ c s . S D ] J a n uery VI System
Start End Q Q H H+W Q
2H 2H+W … Start End Ref. Dist. Q D Q H H+W R D Q
2H 2H+W R D … Match
Consolidation and Revision
Start End Ref. Dist. R D T T R D T T R D … Ref. Database
Query windows Matches per window Final results
Fig. 1 . Overall block diagram of our end-to-end workflow.
2. METHODOLOGY2.1. System overview
Our workflow consists of several steps for processing a query(i.e., audio file of a concert) and a reference database to retrievea list of songs and their respective start and end timestamps (Fig. 1).Firstly, we process the audio queries with a sliding window ofsize W and a hop size H (windowing is not applied to the referencesongs). For each windowed query Q i , we use a VI system to retrievethe most similar item from the reference database. After obtainingindividual matches for each window, we perform a number of post-processing steps to consolidate and revise those, and form a final listof results. Lastly, we compute several evaluation metrics. The first step of our system is to extract useful information frommusic audio signals. For this, we use cremaPCP representa-tions [12, 17], extracted with the pre-trained model shared in https://github.com/bmcfee/crema . Recent works inVI show that this pitch class profile variant generally improves sys-tem performance, compared to other variants [18]. We use a hopsize of 4,096 samples for audio signals sampled at 44.1 kHz.
For obtaining pairwise distance values between query windowsand references, we compare three VI systems. We consider W = { , , } s and H = { , , } s. We perform initial ex-periments on our development set to pick the best W and H valuesfor each algorithm based on the total length of correctly-identifiedsegments (see DLP metric in Sec. 2.6). Considered systems are: Re-MOVE —
Re-MOVE [13] is a recent model that is trained withembedding distillation techniques to further improve both the accu-racy and the scalability aspects of a state-of-the-art VI system [12].It encodes each song into an embedding vector of size 256. Wetransfer the pre-trained weights of the model shared in https://github.com/furkanyesiler/re-move into an equiva-lent Keras [19] model (no re-training or fine-tuning is performed).As distance between embeddings, we use cosine distance.
Qmax —
Qmax refers to the VI system proposed by Serr`a et al. [9].The similarity estimation between two songs is performed using alocal alignment algorithm, and the length of the longest aligned sub-sequence is considered as the distance between the songs after beingnormalized by the length of the reference song. We use the imple-mentation shared in Essentia [20] with the default parameters.
Using the VI systems described above, we compute the distancesbetween each query window Q i and each item from the referencedatabase. The reference song with the lowest distance to Q i is con-sidered as its potential match. However, using a windowing schemacan create several potential matches for query segments. To reducethe number of matches to a single match for any given time frame,we perform a series of operations to consolidate and revise the ob-tained potential matches. Note that, although the VI system can beconsidered as the main component of our entire workflow, this laststep is highly important to obtain a useful final list of results.Our first step is to merge the consecutive overlapping matchesthat return the same reference song, and the distance value for themerged match is selected as the lowest distance among the respec-tive matches. Next, to avoid the overlapping matches that return dif-ferent reference songs, we obtain all possible overlaps and select thereference song that comes from the match with the lowest distancefor each overlapping segment. Finally, we perform another mergingstep by joining any consecutive matches that return the same refer-ence and have no gap between them (i.e., the matches that may besplit in the previous step). The final results do not contain overlap-ping matches for any segment of the query.Our initial experiments showed that this consolidation and revi-sion step is useful for filtering out many incorrect matches, however,along with a few correct ones. To further reduce the number of in-correct matches, we simply train a support vector machine modelfor a binary classification (correct/incorrect) task, using the scikit-learn library [22] and the distance and duration values of correct andincorrect matches as features. We see that this simple classifier dras-tically reduces the number of false positives, however, at the expenseof slightly increased false negatives. For our experiments, we have collected and annotated a new dataset,ASID: automatic setlist identification dataset. It contains pre-extracted features, metadata, Youtube or Soundcloud links, andtimestamp annotations for 75 concerts and all the relevant referencesongs (i.e., the songs that are played in each concert). Concert du-rations range between 21.7 min and 2.5 h, with a total duration of99.5 h. The total number of reference songs is 1,298, with a totalduration of 90.1 h. As mentioned in Sec. 1, we make this datasetpublicly available.ASID includes a variety of use cases regarding audio quality andgenres. For this, we have selected three categories for audio qual-enre AQ-A AQ-B AQ-C TotalPop/Commercial 8 (5) 3 3 14 (5)Rock/Metal 8 (3) 7 6 21 (3)Indie/Alternative 5 7 3 15Hip-hop/Rap 5 (2) 0 3 8 (2)Electronic 6 1 0 7Total 32 (10) 18 15 65 (10)
Table 1 . Number of concerts per audio quality and genre. The num-bers in parenthesis indicate the concerts in the development set.ity: AQ-A, AQ-B, and AQ-C. AQ-A contains high-quality record-ings, mainly coming from broadcast recordings or official releases.AQ-B contains professionally recorded concerts, mainly from smallvenues (in general, we observe that the mixing/mastering quality forconcerts in AQ-B is inferior to the ones in AQ-A). Lastly, AQ-Ccontains smartphone or video camera recordings from varying-sizevenues/events. In terms of genre, we categorize the concerts into5 main groups: pop/commercial, rock/metal, indie/alternative, hip-hop/rap, and electronic. The number of concerts for each audio qual-ity and genre can be seen in Table 1.We use 10 concerts (14.3 h) as a separate development set to se-lect W and H for each VI system, and to train a classifier for thematch revision step. The references for the development set include180 songs. The remaining 65 concerts (85.2 h) and the related refer-ence set is used for the main results. The total number of annotatedsegments for the evaluation set is 1,138, with a duration of 80.6 h. Following common practice in industrial contexts, we evaluate ourapproach using: (1) true positives (TP), the number of matches (aftermerging and removing overlaps) that overlap the correct annotationsin the ground truth (several correct matches that overlap the same an-notation are counted separately); (2) false positives (FP), the numberof matches (after merging and removing overlaps) that do not corre-spond to the related annotation; (3) detected annotations percentage(DAP), the ratio of the number of detected annotations with respectto the total number of annotations in the ground truth; and (4) de-tected length percentage (DLP), the ratio of the correctly-identifiedduration of all TP with respect to the total duration of annotations inthe ground truth.
3. RESULTS3.1. Overall results
Based on the DLP values obtained for the development set, we select ( W, H ) pairs for each considered VI system. For Re-MOVE andQmax, we select both (120,30) and (120,60) since the DLP valuesusing those pairs result in very minor differences, and for 2DFTM,we select only (120,15). Due to the computation time needed toevaluate Qmax on our test set (see Sec. 3.3), we compute results foronly (120,30), and simulate the results for (120,60) by skipping thematches for every second window.The overall results for each considered system and ( W, H ) pairshow that, while the performances of Re-MOVE and Qmax systemsare fairly close, they both outperform 2DFTM by a considerablemargin (Table 2). Both Re-MOVE and Qmax could identify +78%of the annotated segments (DAP metric, before the classifier), whichis a considerably good performance albeit using arbitrary windows VI Config. TP FP DAP (%) DLP (%)R - (120,30) / / / 112 79.8 / 64.6 60.3 / 54.8Q - (120,30) 902 / 766 1217 / 132 78.3 / Q - (120,60) 905 / 736 1039 / / 56.0F - (120,15) 736 / 524 2686 / 453 61.3 / 46.0 41.1 / 37.1 Table 2 . Overall results for 5 configurations on the evaluation set. R,Q, and F denote Re-MOVE, Qmax, and 2DFTM, respectively. Theleft/right values denote the metrics before/after the classifier. R e - M O V E TPFP R e - M O V E TPFP0.0 0.2 0.4Distance Q m a x TPFP 0 200 400 600Duration in seconds Q m a x TPFP
Fig. 2 . Distance (left) and duration (right) distributions of TP andFP for Re-MOVE (120,30) (top) and Qmax (120,60) (bottom).for retrieval instead of clearly-segmented ones. The differences be-tween DAP and DLP values suggest imprecise timestamp retrievalsthat result mainly from using fixed W and H values without anyfine-grained refinements on the timestamp resolution.The values before and after the classifier show that even a simpleclassifier can reduce the number of FPs by more than 80% whilereducing the TPs by only 15–20% (excluding 2DFTM) and the DLPsby 4.5% on average. This suggests that the distance and durationdistributions for TPs and FPs are different enough to enable usefulclassification, especially for Qmax (120, 60) (Fig. 2). We now present results separately for each au-dio quality, using a limited set of configurations (Table 3). The to-tal number of annotated segments and the total duration of those(TA and TL, respectively) for categories AQ-A, AQ-B, and AQ-Care (581, 48.8 h), (241, 15.8 h), and (316, 16.9 h), respectively. Theresults suggest that, surprisingly, the audio quality is not the mostcrucial factor affecting system performance as both systems resultin higher DLP values for AQ-B and AQ-C compared to AQ-A. Thisshows that our input representation performs robustly against noise.The low values for AQ-A may mainly result from the variety of in-cluded genres/styles.
Genre —
The results categorized by genres can be seen in Ta-ble 4. TA and TL values for each category from top to bottomare (272, 17.1 h), (372, 25.6 h), (208, 12.3 h), (171, 17.6 h) and(115, 8.0 h), respectively. We observe that the performances of bothVI systems are consistent across genres, with “Hip-hop/Rap” beingan outlier. Relying on only harmonic information for retrieval leadsto a drastic performance decrease for certain musical styles (i.e., hip-hop). However, the effect of genre on system performance is notthe only decisive factor. The results depicted in Fig. 3 show that thesystem performance can show a large variance among concerts evenwithin the same genre, with “Pop/Commercial” having the mostconsistent results.I Config. TP FP DAP (%) DLP (%)
AQ-A
R - (120, 30) /
586 / 96 / Q - (120, 60) 458 / 393 / / 52.3 AQ-B
R - (120, 30) 215 / 176 245 / 37 87.6 / 73.4 69.0 / 64.2Q - (120, 60) /
179 195 / / / AQ-C
R - (120, 30) /
183 244 / 44 / / Q - (120, 60) 226 / 164 272 / Table 3 . Results based on audio quality. R and Q denote Re-MOVEand Qmax, respectively. The left/right values denote the metrics be-fore/after the classifier. D L P ( % ) ElectroHip-hopIndiePopRock
Fig. 3 . DLP and FP values after the classifer for each concert evalu-ated with Re-MOVE (120,30), categorized by genre.
Reference set size —
Although we evaluate our systems for eachconcert using the entire reference set, a more realistic scenarioshould include a significantly larger reference set. For this, wegradually expand our reference set using the MTG-Jamendo dataset(MJD) [23], which contains the full audio tracks of 55.7 k royalty-free songs. We assume that there is no intersection between ASIDand MJD. Due to computation requirements (see Sec. 3.3), we onlyevaluate the Re-MOVE (120,30) setting in this scenario. Table 5shows that an increase in reference set size negatively affects thesystem accuracy. However, the system can still correctly identify70% of the annotated segments (before the classifier), and the de-crease in performance seems to be saturating after 45 k references,at least for the considered size regime.
Finally, we share our observations regarding algorithm runtimes.Since computation requirements depend on W and H , we here con-sider W = 120 and H = 30 . Using pre-extracted cremaPCP fea-tures as input for each system and executing parallel computationswith 32 cores, the Qmax algorithm takes approximately 20 days tocomplete the entire distance computations used for the main resultsin Table 2. Contrastingly, both Re-MOVE and 2DFTM take only11 min. For the full MJD-expanded task in Table 5, the runtimeof Re-MOVE only increases to 22 min (using pre-computed em-beddings). Although Re-MOVE and Qmax result in similar perfor-mances, the drastic difference in their runtimes suggests that Re-MOVE is the only considered system that both scales up to large-scale retrieval scenarios and achieves a plausible accuracy. VI Config. TP FP DAP (%) DLP (%) Pop/Commercial
R - (120, 30) /
207 221 / 40 / Q - (120, 60) 231 / 192 231 / / 71.1 Rock/Metal
R - (120, 30) /
422 / 56 83.3 / /
22 83.9 / 70.2 / Indie/Alternative
R - (120, 30) /
153 118 / 22 / Q - (120, 60) 191 / 148 134 / / 66.0 Hip-hop/Rap
R - (120, 30) /
60 246 / 47 / / Q - (120, 60) 63 / 47 264 / Electronic
R - (120, 30) /
86 68 / 12 /
69 / / . Results based on genre. R and Q denote Re-MOVE andQmax, respectively. The left/right values denote the metrics be-fore/after the classifier.Extra refs. TP FP DAP (%) DLP (%)None 936 / 771 1075 / 177 80.3 / 67.8 60.5 / 56.915k 860 / 678 1606 / 220 73.6 / 59.5 53.0 / 48.830k 836 / 661 1738 / 217 71.6 / 58.1 51.6 / 47.445k 812 / 643 1785 / 241 69.8 / 56.6 49.9 / 45.955.7k 812 / 639 1841 / 244 69.7 / 56.2 49.6 / 45.5 Table 5 . Results of Re-MOVE (120,30) on the MJD-expanded task.
4. CONCLUSION AND LEARNINGS
In this work, we have investigated the effectiveness of VI systemsfor automatic SLI in a wide range of use cases. For this, we haveproposed an end-to-end workflow to identify the metadata and times-tamps of the songs that are present in full concerts. For the retrievalstep, we have compared three VI systems in terms of accuracy andscalability. We have proposed a series of post-processing steps thatconsolidates and revises the initial retrieved matches to filter out pos-sible false positives for the final results. We have used a new datasetthat contains 99.5 h of concerts, which we publicly share. Our find-ings suggest that while the audio quality of queries does not have acrucial effect on performance due to the robustness of our input rep-resentation against noise, the changes in musical styles/genres canhave a drastic impact as our system depends solely on the harmonicinformation from the audio. For processing the audio queries, us-ing pre-determined window and hop sizes results in imprecise times-tamps for the retrieved matches. We have also shown that increasingthe size of the reference database negatively impacts the system ac-curacy. Finally, the reported runtimes for the considered configura-tions show a remarkable difference between using alignment-basedor embedding-based VI systems. Overall, using Re-MOVE for re-trieval yields promising results towards automatic SLI in large-scalecontexts; however, further improvements for the general workfloware required to address real-world live performance monitoring usecases. In future work, we plan to investigate using an ensemble ofVI systems that use various musical characteristics (e.g., melody,harmony) for the retrieval phase, and more elaborate false positivefiltering schemes. . REFERENCES [1] A. Wang, “The Shazam music recognition service,”
Commun.ACM , vol. 49, no. 8, pp. 44–48, 2006.[2] B. Gfeller, B. Aguera-Arcas, D. Roblek, J. D. Lyon, J. J. Odell,K. Kilgour, M. Ritter, M. Sharifi, M. Velimirovi´c, R. Guo, andS. Kumar, “Now Playing: Continuous low-power music recog-nition,” in
NIPS 2017 Workshop: Machine Learning on thePhone , 2017.[3] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A review ofaudio fingerprinting,”
J. VLSI Signal Process. Syst. , vol. 41,no. 3, pp. 271–284, 2005.[4] J. Haitsma, T. Kalker, and J. Oostveen, “Robust Audio Hashingfor Content Identification,” in
Int. Workshop on Content-BasedMultimedia Indexing (CBMI) , 2001.[5] A. Wang, “An industrial-strength audio search algorithm,” in
Proc. of the Int. Society for Music Information Retrieval Conf.(ISMIR) , 2003.[6] S. Fenet, G. Richard, and Y. Grenier, “A scalable audio fin-gerprint method with robustness to pitch-shifting,” in
Proc. ofthe Int. Society for Music Information Retrieval Conf. (ISMIR) ,2011, pp. 121–126.[7] S. Joren and M. Leman, “Panako - a scalable acoustic finger-printing system handling time-scale and pitch modification,” in
Proc. of the Int. Society for Music Information Retrieval Conf.(ISMIR) , 2014, pp. 259–264.[8] R. Sonnleitner and G. Widmer, “Quad-based audio fingerprint-ing robust to time and frequency scaling,” in
Proc. of the Int.Conf. on Digital Audio Effects (DAFx) , 2014, pp. 1–8.[9] J. Serr`a, X. Serra, and R. G. Andrzejak, “Cross recurrencequantification for cover song identification,”
New Journal ofPhysics , vol. 11, pp. 093017, 2009.[10] T. Bertin-Mahieux and D. P. W. Ellis, “Large-scale cover songrecognition using the 2D Fourier Transform magnitude,” in
Proc. of the Int. Society on Music Information Retrieval Conf.(ISMIR) , 2012, pp. 241–246.[11] G. Doras and G. Peeters, “Cover detection using dominantmelody embeddings,” in
Proc. of the Int. Society for MusicInformation Retrieval Conf. (ISMIR) , 2019, pp. 107–114.[12] F. Yesiler, J. Serr`a, and E. G´omez, “Accurate and scalable ver-sion identification using musically-motivated embeddings,” in
Proc. of the IEEE Int. Conf. on Acoustics, Speech and SignalProcessing (ICASSP) , 2020, pp. 21–25.[13] F. Yesiler, J. Serr`a, and E. G´omez, “Less is more: Fasterand better music version identification with embedding distilla-tion,” in
Proc. of the Int. Conf. on Music Information Retrieval(ISMIR) , 2020, pp. 884–892.[14] J.-C. Wang, M.-C. Yen, Y.-H. Yang, and H.-M. Wang, “Au-tomatic set list identification and song segmentation for full-length concert videos,” in
Proc. of the Int. Society for MusicInformation Retrieval Conf. (ISMIR) , 2014, pp. 239–244.[15] Z. Rafii, B. Coover, and J. Han, “An audio fingerprinting sys-tem for live version identification using image processing tech-niques,” in
Proc. of the IEEE Int. Conf. on Acoustics, Speechand Signal Processing (ICASSP) , 2014, pp. 644–648.[16] T. Tsai, T. Pr¨atzlich, and M. M¨uller, “Known-artist live songidentification using audio hashprints,”
IEEE Transactions onMultimedia , vol. 19, no. 7, pp. 1569–1582, 2017. [17] B. McFee and J. P. Bello, “Structured training for large-vocabulary chord recognition,” in
Proc. of the Int. Societyfor Music Information Retrieval Conf. (ISMIR) , 2017, pp. 188–194.[18] F. Yesiler, C. Tralie, A. Correya, D. F. Silva, P. Tovstogan,E. G´omez, and X. Serra, “Da-TACOS: A dataset for coversong identification and understanding,” in
Proc. of the Int. So-ciety for Music Information Retrieval Conf. (ISMIR) , 2019, pp.327–334.[19] F. Chollet et al., “Keras,” https://keras.io , 2015.[20] D. Bogdanov, N. Wack, E. G´omez, S. Gulati, P. Herrera,O. Mayor, G. Roma, J. Salamon, J. R. Zapata, and X. Serra,“ESSENTIA: An audio analysis library for music informationretrieval,” in
Proc. of the Int. Society for Music InformationRetrieval Conf. (ISMIR) , 2013, pp. 493–498.[21] B. McFee, C. Raffel, D. Liang, D. P. W. Ellis, E. BattenbergM. McVicar, and O. Nieto, “librosa: Audio and music signalanalysis in python,” in
Proc. of the python in science confer-ence (SciPy)) , 2015, pp. 18–25.[22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Ma-chine learning in Python,”
Journal of Machine Learning Re-search , vol. 12, pp. 2825–2830, 2011.[23] D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra,“The MTG-Jamendo dataset for automatic music tagging,” in