Disentangled Multidimensional Metric Learning for Music Similarity
Jongpil Lee, Nicholas J. Bryan, Justin Salamon, Zeyu Jin, Juhan Nam
DDISENTANGLED MULTIDIMENSIONAL METRIC LEARNING FOR MUSIC SIMILARITY
Jongpil Lee ∗ Nicholas J. Bryan Justin Salamon Zeyu Jin Juhan Nam Graduate School of Culture Technology, KAIST, Daejeon, South Korea Adobe Research, San Francisco, CA, USA
ABSTRACT
Music similarity search is useful for a variety of creative taskssuch as replacing one music recording with another recordingwith a similar “feel”, a common task in video editing. Forthis task, it is typically necessary to define a similarity met-ric to compare one recording to another. Music similarity,however, is hard to define and depends on multiple simul-taneous notions of similarity (i.e. genre, mood, instrument,tempo). While prior work ignore this issue, we embrace thisidea and introduce the concept of multidimensional similarityand unify both global and specialized similarity metrics into asingle, semantically disentangled multidimensional similaritymetric. To do so, we adapt a variant of deep metric learn-ing called conditional similarity networks to the audio do-main and extend it using track-based information to controlthe specificity of our model. We evaluate our method andshow that our single, multidimensional model outperformsboth specialized similarity spaces and alternative baselines.We also run a user-study and show that our approach is fa-vored by human annotators as well.
Index Terms — multidimensional music similarity, met-ric learning, disentangled representation, query-by-example.
1. INTRODUCTION
Traditional music search methods such as those availableon streaming services and online music repositories use text-based metadata (e.g. song, artist, album, and/or semantic tags)for music retrieval. However, there are scenarios where mu-sic metadata is either unavailable or insufficient: a concreteexample is what we shall refer to as the “music replace-ment” problem, where a user wishes to replace one musicrecording with another recording that has a similar “feel”,a common use case e.g. in video editing. Describing thedesired musical traits may be extremely hard to do withtext, but the user has an example of what they are search-ing for, and so query-by-example, and more specificallycontent-based music similarity and retrieval, is an attractivesolution. While content-based music similarity has beenstudied extensively [1], it has found limited application inmusic recommendation platforms, which rely most heavilyon interaction and metadata based collaborative filtering [2]. * This work was performed during an internship at Adobe Research. instrument genremood A B C ROCK track
Fig. 1 . An illustration of multiple dimensions of music sim-ilarity. Letters (A, B, C) denote different music recordings,while lines denote different dimensions of similarity.Such techniques are not applicable, however to the musicreplacement scenario, where there may be little-to-no interac-tion data, and a user’s past music replacement selections canhave little correlation with future replacement needs. Froma retrieval specificity perspective, music replacement is lessspecific than music identification (fingerprinting), but morespecific than tag-based retrieval (e.g. genre) or than findingsimilar-sounding music for listening purposes [1, 3], since thepragmatic goal of music replacement is to find songs whichsound as close as possible to a query without being identical.Content-based music similarity typically involves extract-ing a feature representation from audio recordings and com-puting the similarity (or distance) between them using a met-ric or score function. Previous approaches include vectorquantization [4], linear metric learning [5, 6, 7], and, morerecently, deep metric learning [8, 9, 10] using human simi-larity labels [11], artist labels [12], track labels [13], or tagsin the context of zero-shot learning [14]. A common lim-itation of these approaches is that similarity is modeled asuni-dimensional, i.e. songs are modelled as similar or dissim-ilar along a single global axis. In actuality, music is a mul-tidimensional phenomenon, and consequently there are vari-ous different dimensions along which songs can be compared(e.g. timbre, rhythm, genre, mood, etc.), and songs can besimultaneously similar along some dimensions, while differ-ent along others, as illustrated in Figure 1. It is also hardto determine precisely which dimensions people take into ac-count when rating songs for similarity, or how they weightthe importance of these dimensions. For this reason, from anapplication standpoint it can be beneficial to allow the userto specify which musical dimensions they care about whensearching-by-example and how to weight their importance. a r X i v : . [ ee ss . A S ] A ug n this paper, we propose a deep disentangled metriclearning method for learning a multidimensional music sim-ilarity space (embedding). We adapt Conditional SimilarityNetworks [15], previously only applied to images, to theaudio domain, and employ a combination of user-generatedtags and algorithmic estimates (i.e. tempo) to train a disentan-gled embedding space composed of sub-spaces correspond-ing to similarity along different musical dimensions: genre,mood, instrumentation and tempo. Further, we propose atrack-regularization technique to increase overall perceptualsimilarity across all dimensions as judged by humans. Weevaluate our approach against several baselines, showing ourproposed approach outperforms them both in terms of globalsimilarity and similarity along specific dimensions. To val-idate our quantitative results, we run a user-study and showthat our proposed approach is favored by human annotatorsas well.
2. LEARNING MODEL2.1. Metric learning with triplet loss
We use deep metric learning with a triplet loss as the basisfor our learning model [8, 9]. On a high level, our model ispresented with a triplet of samples, where one is consideredthe “anchor” and the other two consist of a “positive” anda “negative”, and the model is trained to map the samplesinto an embedding space where the “positive” is closer to the“anchor” than the “negative”, as illustrated in Figure 2(A).Formally, we define training triplets as a set T = { t i } Ni =1 ,where each triplet t i = { x ia , x ip , x in | s ( x a , x p ) > s ( x a , x n ) } , x a is the anchor sample, x p is the positive sample, x n is thenegative sample, and s is the musical dimension along whichsimilarity is measured. Then, we define the triplet loss as: L ( t ) = max { , D ( x a , x p ) − D ( x a , x n ) + ∆ } , (1)where D ( x i , x j ) = || f ( x i ) − f ( x j ) || is the euclidean dis-tance between two audio embeddings, ∆ is a margin value toprevent trivial solutions, and f ( · ) is a nonlinear embeddingfunction or deep neural network that maps the audio input tothe embedding space. For a given set T and embedding func-tion f ( · ) , we use stochastic gradient descent to update thenetwork weights and minimize the loss. To jointly model multiple semantic dimensions of similaritywithin a single network, we adapt the work of Veit et al. [15],which proposed the use of Conditional Similarity Networks(CSN) [15] for attribute-based image retrieval. The methodintroduces masking functions m s ∈ R d , which are applied tothe embedding space of size d . Each mask corresponds to acertain similarity dimension s (denoted “condition” in [15]),e.g. mood or tempo, and is used to activate or block disjointregions of the embedding space, as illustrated in Figure 2(B).Given a specific similarity dimension s , training tripletsare defined as T s = { t is } Ni =1 , with each triplet given by: embedding masks masked features genremoodinstrumenttempo t r ac k information used for triplet sampling ca t e go r y (A)(B)(C) Fig. 2 . Our proposed approach. (A) Standard triplet-baseddeep metric model, (B) conditional similarity masking, and(C) track regularization. t is = ( x ia , x ip , x in ; s ) , (2)and the training set combining triplets sampled from all simi-larity dimensions is defined as T S = { T s } Ss =1 . Consequently,we update the distance function to: D ( x i , x j ; s ) = || f ( x i ) ◦ m s − f ( x j ) ◦ m s || , (3)such that the mask m s only passes through the subspace ofembedding features corresponding to similarity dimension s during training and ◦ denotes Hadamard product. Accord-ingly, the loss is updated to: L ( t s ) = max { , D ( x a , x p ; m s ) − D ( x a , x n ; m s )+∆ } . (4) As noted earlier, music replacement requires retrieved songsto sound as close as possible to the query example. To thisend, we propose to complement the aforementioned multi-dimensional metric learning approach with a regularizationtechnique we refer to as “track regularization”. The approachinvolves sampling an additional set of triplets solely based onthe track (song) information: the anchor and positive are bothsampled from the same song, while the negative is sampledfrom a different song. While this sampling was used previ-ously to learn high-specificity music similarity directly [13],here we use it as a “similarity regularization” technique to en-force a certain degree of consistency across the entire (multi-dimensional) embedding space. With this regularization, ourfinal loss is given by: L ( t c , t t ) = L ( t c ) + λL ( t t ) , (5)where t c are all triplets sampled from the various music sim-ilarity dimensions corresponding to disjoint sub-embeddingspaces, t t are triplets sampled using track information, and λ allows us to control the trade-off between semantic similar-ity (low-specificity) and overall track-based similarity (highspecificity). Importantly, for track-based triplets, we use amask with a value of one for all feature dimensions, meaningthe regularization is applied to the complete embedding spaceto capture track similarity across all musical dimensions. Al-ternatively, this can be thought of as not applying any maskingon the embedding space. . EXPERIMENTAL DESIGN3.1. Dataset and input features For our experiments, we use the Million Song Dataset(MSD) [16]. Based on preliminary user studies on musicreplacement, we identify four relevant musical dimensionsto consider: genre , mood , instrumentation , and tempo . Todetermine whether two songs are similar along these dimen-sions, we use Last.FM tag annotations associated with MSDtracks which have been previously grouped into different cat-egories [17], resulting in 28 genre tags, 12 mood tags, and5 instrument tags. Since the annotations lack tempo tags,we extract an algorithmic tempo estimate per track using theMadmom Python library [18, 19]. Two tracks are consideredsimilar along a certain musical dimension (genre, mood, in-struments) if they share at least one tag in that category, orare within 5 BPM of each other in the case of tempo. Fortrack-based triplets, we ensure there is no more than 50%overlap between the anchor and positive samples. We splitthe data following [20], giving 201680, 11774, and 28435samples for the train, validation, and test sets, respectively.For training, we use 3-second excerpts represented as alog-scaled mel-spectrogram S , extracted with librosa [21].We use a window size of 23 ms with 50% overlap andcompute 128 mel-bands per frame with the following log-compression: log (1 + 10 ∗ S ) , resulting in input dimensionsof × as in [12]. The representation is z-score stan-dardized using fixed mean and standard deviation values of0.2 and 0.25, respectively. For choosing the triplet network architecture, we ran prelim-inary experiments with several state-of-the-art convolutionalbuilding blocks [22], including a basic conv-batchnorm-maxpool block, ResNet [23], Squeeze-and-Excitation [24],and Inception [25]. Having identified the Inception blockas the best option, we use the following model architecture:we start with 64 convolutional filters with a × kernelfollowed by × strided max-pooling, followed by six In-ception blocks each comprising a “na¨ıve” inception modulewith stride followed by another inception module with afinal output dimensionality of [25]. We use ReLU non-linearities for all layers, and apply L normalization to theembedding features prior to computing the distance [10].Since our total embedding size is 256 and we considerfour music similarity dimensions (genre, mood, instruments,tempo), each with a disjoint subspace of size 64. We also ex-perimented with a trainable masking layer [15] (as opposedto fixed disjoint masks), but found it did not lead to any sig-nificant improvement. Moreover, using fixed masks has theadded benefit of allowing us to weight each musical dimen-sion independently post-hoc which, as noted earlier, is a de-sirable user interaction paradigm. We use the Adam opti-mizer [26] for training. We initialize the learning rate to . and reduce it by a factor of 5 when the validation loss doesnot decrease for 4 epochs, up to 5 times, after which we ap-ply early stopping. The margin for the triplet loss is set to . . And, after empirically hearing the properties of similar-ity space, λ was set to 0.5 when track regularization is applied. For evaluation we use a set of held-out triplets sampled fromthe test set. We sample 40,000 triplets per music dimension(genre, mood, instruments, tempo) as well as 40,000 tripletsbased on track information. To simulate our application sce-nario, we use triplets of full songs for evaluation, the onlyexception being track-based triplets, where we stick to 3 sec-ond excerpts since the anchor and positive are sampled fromthe same song and should not overlap by more than 50%. Theembedding for a full song is obtained by computing embed-ding frames from 3-second non-overlapping windows and av-eraging them over the time dimension. Given a test triplet,a model is evaluated by testing whether the embedding dis-tance between the anchor and positive samples is smaller thanthe distance between the anchor and negative (score of 1), orgreater (score of 0). The scores for all triplets are averaged toobtain a final score between 0 (worst) and 1 (best).To determine whether human subjects concur with theabove quantitative evaluation, we also randomly sampled4,000 triplets from the test set and asked people to annotatewhich track sounded more similar to the anchor (positive ornegative) without showing which was which. Each tripletwas annotated by 5-12 people, resulting in 39,440 humanannotations. We then calculated the annotator agreement pertriplet, defined as the ratio between the majority vote and to-tal number of annotations, and filtered out triplets where theagreement was below 0.9, resulting in 879 high-agreementhuman-annotated triplets. Since similarity judgements have ahigh degree of subjectivity, in this way, we can limit the scopeof our human evaluation to triplets where there is broad anno-tator agreement. Models are evaluated against these tripletsas described earlier, obtaining a score between 0–1 in termsof consistency with user ratings. For reproducibility, we shareour dataset of user similarity ratings, dim-sim , online, alongwith audio similarity examples for the proposed approach . As a strong baseline, we implement a vector quantizationmethod that has been used for both similarity-based musicretrieval and auto-tagging [5, 27]. We compute 13 MFCCcoefficients and their first and second derivatives per framefor each track, randomly select 2,500,000 frames from alltracks and cluster them using K-means with K = 1024 toproduce a dictionary [5]. Given the dictionary, a track embed-ding is obtained by assigning each MFCC frame to its closestcluster and computing a normalized histogram of cluster https://jongpillee.github.io/multi-dim-music-sim/ sed space Embedding Features Genre Mood Instruments Tempo OverallAll-dimensions MFCC-VQ 0.563 0.481 0.495 0.516 0.514Track 0.611 0.595 0.531 0.534 0.568Category 0.647 0.633 0.562 0.875 0.679Category + track regularization 0.647 0.627 0.561 0.891 0.681Category + disentanglement 0.708 0.717 0.657 0.783 0.716Category + disentanglement + track regularization 0.693 0.704 0.626 0.836 0.715Sub-dimensions Set of specialized networks 0.708 0.619 0.603 0.942 0.718Category + disentanglement 0.785 0.790 0.798 0.955 0.832Category + disentanglement + track regularization 0.765 0.743 0.700 0.953 0.790 Table 1 . Prediction accuracy of category-based (genre, mood, instruments, tempo) triplets.
Embedding Features Track UserMFCC-VQ 0.833 0.654Track 0.950 0.763Category 0.975 0.766Category + track regularization 0.980 0.740Category + disentanglement 0.985 0.763Category + disentanglement + track regularization 0.988 0.792
Table 2 . Results on track-based and user-based triplets.assignments. The distance between any two tracks is thengiven by the Euclidean distance between their normalizedhistograms [5].
4. RESULTS
In Table 1, we present the numerical results obtained for eachof the four held-out triplet sets corresponding to a music sim-ilarity dimension, as well as aggregated scores over all fourtriplet sets (“Overall”). The “Used space” column indicateswhich subset of the embedding space was used to computethe distance between pairs of tracks, where “all dimensions”means all embedding features were used ( f ( x ) ), whereas“sub-dimensions” means only the subspace correspondingto the musical dimension ( f ( x ) ◦ m s ) from which the testtriplets were sampled was used. We compare six models plusthe baseline, specified in the “embedding features” column.The “Track” model was trained on triplets sampled basedon track-information only, the “Category” model was trainedon triplets sampled from the four music similarity dimen-sions (categories) of genre, mood, instruments and tempo,including both with and without disentanglement (subspacemasking) and track regularization. For disentangled mod-els, we include an additional baseline, “Set of specializednetworks”, which is comprised of four separate triplet-lossnetworks, each trained exclusively on triplets sampled fromone of the four musical dimensions.We see that all deep metric learning models outperformthe MFCC-VQ baseline. More importantly, disentangling theembedding improves performance in almost all cases, withour disentangled model trained on all triplets jointly (Cate-gory + disentanglement) even outperforming the specializednetworks trained separately on each dimension.As one might expect, track regularization decreases nu-merical performance on each of the four triplet test sets, asit enforces all embedding subspaces to respect a global no-tion of track similarity. The key question is how does it affect model performance when compared against the human rat-ings obtained from our user study, presented in Table 2. Asa sanity check, we start by evaluating our models against thetrack-based triplet test-set, presented in the “Track” column.We see that, as expected, track-regularization increases per-formance on this high-specificity set. Somewhat surprisingly,training on category-sampled triplets outperforms training ontrack-sampled triplets, with disentanglement increasing per-formance further. Next, we turn to the results obtained fromthe user study, presented in the “User” column. We see thatour proposed approach outperforms the baseline, and, as perour initial hypothesis, track regularization increases the over-all user agreement with our model’s similarity ratings whentraining on category triplets with disentanglement.
5. CONCLUSION
In this paper, we introduce a novel approach for deep metriclearning of a disentangled, multidimensional , music similar-ity space. We use Conditional Similarity Networks trained ona combination of user tags and algorithmic estimates, and in-troduce track regularization to control for retrieval specificity.Through a series of experiments, including both a quantitativeevaluation and a user study, we demonstrate that our proposedapproach outperforms several baselines, with per-dimensionsimilarity performance increasing due to the disentangling ofthe embedding space, and agreement with human annotationsincreasing as a result of track regularization. Our solution isparticularly relevant to the music replacement problem, andopens the door to novel interaction paradigms which permitthe user to select which music dimensions they care about forretrieval, how to weight their relative importance, and howto balance subspace similarity versus high-specificity over-all similarity. This approach can further be extended to gen-eral audio similarity such as voice similarity based on theirspeaker’s condition, phonation, or prosody. In the future,we plan to conduct further user studies to determine humanagreement when considering each musical dimension in iso-lation, and evaluate the performance of our model againstthese ratings. We also plan to explore and evaluate our pro-posed approach for multi-query retrieval (query-by-multiple-examples) and mix-and-match scenarios where the user is in-terested in finding songs whose characteristics match the sub-spaces of different songs (e.g. the genre of example A withthe tempo of example B). . REFERENCES [1] M.A. Casey, R. Veltkamp, M. Goto, M. Leman,C. Rhodes, and M. Slaney, “Content-based music in-formation retrieval: Current directions and future chal-lenges,”
Proc. of the IEEE , vol. 96, no. 4, pp. 668–696,2008.[2] O. Celma, “Music recommendation,” in
Music recom-mendation and discovery , pp. 43–85. Springer, 2010.[3] P. Grosche, M. M¨uller, and J. Serr`a, “Audio content-based music retrieval,” in
Dagstuhl Follow-Ups . SchlossDagstuhl-Leibniz-Zentrum f¨ur Informatik, 2012, vol. 3.[4] B. Logan and A. Salomon, “A music similarity functionbased on signal analysis.,” in
ICME , 2001, pp. 22–25.[5] B. McFee, L. Barrington, and G. Lanckriet, “Learningcontent similarity for music recommendation,”
IEEEtransactions on audio, speech, and language process-ing .[6] M. Slaney, K. Weinberger, and W. White, “Learning ametric for music similarity,” in
ISMIR , 2008.[7] D. Wolff, S. Stober, A. N¨urnberger, and T. Weyde, “Asystematic comparison of music similarity adaptationapproaches,” in
ISMIR . FEUP Edic¸ ˜oes, 2012, pp. 103–108.[8] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang,J. Philbin, B. Chen, and Y. Wu, “Learning fine-grainedimage similarity with deep ranking,” in
CVPR , 2014,pp. 1386–1393.[9] E. Hoffer and N. Ailon, “Deep metric learning usingtriplet network,” in
Int. Workshop on Similarity-BasedPattern Rec.
Springer, 2015, pp. 84–92.[10] A. Jansen, M. Plakal, R. Pandya, D.P.W. Ellis, S. Her-shey, J. Liu, R.C. Moore, and R.A. Saurous, “Unsu-pervised learning of semantic audio representations,” in
ICASSP . IEEE, 2018, pp. 126–130.[11] R. Lu, K. Wu, Z. Duan, and C. Zhang, “Deep ranking:Triplet matchnet for music metric learning,” in
ICASSP .IEEE, 2017, pp. 121–125.[12] J. Park, J. Lee, J. Park, J.-W. Ha, and J. Nam, “Represen-tation learning of music using artist labels,” in
ISMIR ,2018, pp. 717–724.[13] J. Lee, J. Park, and J. Nam, “Representation learningof music using artist, album, and track information,”in
Machine Learning for Music Discovery Workshop,ICML , 2019. [14] J. Choi, J. Lee, J. Park, and J. Nam, “Zero-shot learningfor audio-based music classification and tagging,” in
ISMIR , 2019.[15] A. Veit, S. Belongie, and T. Karaletsos, “Conditionalsimilarity networks,” in
CVPR , 2017, pp. 830–838.[16] T. Bertin-Mahieux, D.P.W. Ellis, B. Whitman, andP. Lamere, “The million song dataset,” in
ISMIR , 2011.[17] K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Convo-lutional recurrent neural networks for music classifica-tion,” in
ICASSP . IEEE, 2017, pp. 2392–2396.[18] S. B¨ock, F. Krebs, and G. Widmer, “Accurate tempoestimation based on recurrent neural networks and res-onating comb filters.,” in
ISMIR , 2015, pp. 625–631.[19] S. B¨ock, F. Korzeniowski, J. Schl¨uter, F. Krebs, andG. Widmer, “Madmom: A new python audio and musicsignal processing library,” in . ACM, 2016, pp. 1174–1178.[20] J. Lee and J. Nam, “Multi-level and multi-scale featureaggregation using pretrained convolutional neural net-works for music auto-tagging,”
IEEE SPL , vol. 24, no.8, pp. 1208–1212, 2017.[21] B. McFee, C. Raffel, D. Liang, D.P.W. Ellis,M. McVicar, E. Battenberg, and O. Nieto, “librosa: Au-dio and music signal analysis in python,” in , 2015, vol. 8.[22] T. Kim, J. Lee, and J. Nam, “Comparison and analysis ofsamplecnn architectures for audio classification,”
IEEEJSTSP , vol. 13, no. 2, pp. 285–297, 2019.[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in
CVPR , 2016, pp.770–778.[24] J. Hu, L. Shen, and G Sun, “Squeeze-and-excitationnetworks,” in
CVPR , 2018, pp. 7132–7141.[25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-novich, “Going deeper with convolutions,” in
CVPR ,2015, pp. 1–9.[26] D. P Kingma and J. Ba, “Adam: A method for stochasticoptimization,” in
ICLR , 2015.[27] D. Liang, J.W. Paisley, and D.P.W. Ellis, “Codebook-based scalable music tagging with poisson matrix fac-torization.,” in