[PDF] Self-Supervised VQ-VAE for One-Shot Music Style Transfer

Abstract

Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical signals have been proposed, most lack the 'one-shot' capability of classical image style transfer algorithms. On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. In this work, we are specifically interested in the problem of one-shot timbre transfer. We present a novel method for this task, based on an extension of the vector-quantized variational autoencoder (VQ-VAE), along with a simple self-supervised learning strategy designed to obtain disentangled representations of timbre and pitch. We evaluate the method using a set of objective metrics and show that it is able to outperform selected baselines.

Full PDF

aa r X i v : . [ c s . S D ] F e b SELF-SUPERVISED VQ-VAE FOR ONE-SHOT MUSIC STYLE TRANSFER

Ondˇrej C´ıfka ⋆ † Alexey Ozerov † Umut S¸ims¸ekli ‡ ⋆ Ga¨el Richard ⋆⋆ LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, France † InterDigital R&D, Cesson-S´evign´e, France ‡ Inria/ENS, Paris, France

ABSTRACT

Neural style transfer, allowing to apply the artistic style of one im-age to another, has become one of the most widely showcased com-puter vision applications shortly after its introduction. In contrast,related tasks in the music audio domain remained, until recently,largely untackled. While several style conversion methods tailoredto musical signals have been proposed, most lack the ‘one-shot’ ca-pability of classical image style transfer algorithms. On the otherhand, the results of existing one-shot audio style transfer methodson musical inputs are not as compelling. In this work, we are specif-ically interested in the problem of one-shot timbre transfer . Wepresent a novel method for this task, based on an extension of thevector-quantized variational autoencoder (VQ-VAE), along with asimple self-supervised learning strategy designed to obtain disentan-gled representations of timbre and pitch. We evaluate the methodusing a set of objective metrics and show that it is able to outperformselected baselines.

Index Terms — Style transfer, music, timbre, self-supervisedlearning, deep learning

1. INTRODUCTION © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution toservers or lists, or reuse of any copyrighted component of this work in other works.

Neural style transfer techniques, originally proposed for images [1,2], allow applying the ‘artistic style’ of one image to another. Re-cently, there has been increased interest in developing similar meth-ods for music, and promising works in this domain have begun toemerge. Especially compelling are results achieved by several recentworks on timbre conversion [3, 4, 5, 6], leading to entertaining appli-cations. However, a common property of these deep learning-basedmethods is that they require training for each individual target instru-ment. Consequently, the set of target instruments available in thesesystems is typically small, as adding new ones is a time-consumingprocess which depends on the availability of clean training data.In the present work, we instead propose to tackle a more generaltask, which we refer to as one-shot timbre transfer . Borrowing theterminology of image style transfer, our goal is to transfer the tim-bre of a style input onto a content input while preserving the pitchcontent of the latter. To this end, we develop a single generic modelcapable of encoding pitch and timbre separately and then combiningtheir representations to produce the desired output.Unlike many previous music style transformation works (e.g.[7, 5, 8, 9]), we neither assume the training data to be paired or oth-

This work was supported by the European Union’s Horizon 2020 re-search and innovation programme under the Marie Skłodowska-Curie grantagreement No. 765068 (MIP-Frontiers). https://g.co/tonetransfer Similarly to [7], we supplement the somewhat ambiguous term ‘timbretransfer’ with the attribute ‘one-shot’ to specify that we aim to imitate thetimbre of one single example presented at test time.

VQ-VAE x contentencoder c , . . . , c L y styleencoder s decoder ˆ x (cid:0) Fig. 1 . A high-level depiction of the proposed method. We extractpairs of segments from audio ﬁles and use them for self-supervisedlearning of a VQ-VAE with an additional style encoder. The con-tent representation c , . . . , c L is discrete, the style representation s is continuous.erwise annotated, nor do we rely on existing models or algorithmsto create artiﬁcial annotations (e.g. pitch contours or timbre-relateddescriptors). This leads to the need for data-driven disentanglement of the pitch and timbre representations learned by the model. In thiswork, we propose to perform this disentanglement using a combi-nation of discrete representation learning (via an extension of thevector-quantized variational autoencoder, or VQ-VAE [10]), self-supervised learning, and data augmentation.Our contributions can be summarized as follows:• We present the ﬁrst neural model for one-shot instrument tim-bre transfer. The model operates via mutually disentangledpitch and timbre representations, learned in a self-supervisedmanner without the need for annotations.• We train and test our model on a dataset where each recordingcontains a single, possibly polyphonic instrument. Using aset of newly proposed objective metrics, we show that themethod constitutes a viable solution to the task, and is able tocompete with baselines from the literature. We also provideaudio examples for perceptual comparison by the reader. • Since our approach to disentanglement is largely data-driven,it should be extensible to other music transformation tasks,such as arrangement or composition style transfer.• Our source code is available online.

2. RELATED WORK

Prior work on our topic is rather limited. To our knowledge, mostexisting works that fall under our deﬁnition of one-shot music tim-bre transfer [11, 12, 13] are based on non-negative matrix factoriza-tion (NMF) combined with musaicing [14] (a form of concatenativesynthesis). Other works on audio style transfer [15, 16] adapt theoriginal image style transfer algorithm [1] to audio, but focus on https://adasp.telecom-paris.fr/s/ss-vq-vae https://github.com/cifkao/ss-vq-vae timbre or style conversion , severalmethods were recently proposed for musical audio [3, 4, 5, 6]. Whilethese approaches achieve remarkable output quality, they cannot beconsidered one-shot as they only allow for conversion to the (small)set of styles present in the training data. Moreover, unlike our meth-ods, they require training a separate decoder for each target style; inparticular, [3] report unsuccessful attempts to train a single decoderconditioned on the identity of the target instrument.Other recent works [17, 18, 19, 20] are related to ours in thatthey also learn a continuous timbre representation which allows foraudio generation, but are limited to the simple case of isolated notes.

3. BACKGROUND3.1. Vector-quantized autoencoder (VQ-VAE)

The VQ-VAE [10] is an autoencoder with a discrete latent repre-sentation. It consists of an encoder, which maps the input x to asequence z of discrete codes from a codebook, and a decoder, whichtries to map z back to x . Using discrete latent codes places a limiton the amount of information that they can encode. The authors suc-cessfully exploit this property to achieve voice conversion. In thiswork, we follow a similar path to achieve music style transfer.Formally, the encoder ﬁrst outputs a sequence E ( x ) ∈ R L × D of D -dimensional feature vectors, which are then passed through aquantization (discretization) operation Q which selects the nearestvector from a discrete embedding space (codebook) e ∈ R K × D : z i = Q ( E i ( x )) = arg min e j , ≤ j ≤ K (cid:13)(cid:13) E i ( x ) − e j (cid:13)(cid:13) . (1)The model is trained to minimize a reconstruction error L ae be-tween the input x and the output of the decoder D ( Q ( E ( x ))) . Thebackpropagation of its gradient through the discretization bottleneck Q to the encoder is enabled via straight-through estimation , wherethe gradient with respect to Q ( E ( x )) received from the decoder isinstead assigned to E ( x ) . To ensure the alignment of the codebook e and the encoder outputs E ( x ) , two other terms appear in the VQ-VAE objective – the codebook loss and the commitment loss: L cbk = (cid:13)(cid:13) sg (cid:2) Q ( E ( x )) (cid:3) − E ( x ) (cid:13)(cid:13) , (2) L cmt = (cid:13)(cid:13) Q ( E ( x )) − sg (cid:2) E ( x ) (cid:3)(cid:13)(cid:13) . (3)Here sg[ · ] stands for the ‘stop-gradient’ operator, deﬁned as iden-tity in the forward computation, but blocking the backpropagationof gradients. The two losses are therefore identical in value, butthe ﬁrst only affects (i.e. has non-zero partial derivatives w.r.t.) thecodebook e (via Q ), while the second only affects the encoder E . Aweighting hyperparameter β is applied to L cmt in the total loss: L = L ae + L cbk + β L cmt (4) Self-supervised learning is a technique for learning representationsof unlabeled data. The basic principle is to expose the inner struc-ture of the data – by splitting each example into parts or by applyingsimple transformations to it – and then exploit this structure to deﬁne an artiﬁcial task (sometimes called the pretext task ) to which super-vised learning can be applied. Notable examples include predictingcontext (e.g. the neighboring words in a sentence [21] or a missingpatch in an image [22]), the original orientation of a rotated image[23] or the ‘arrow of time’ in a (possibly reversed) video [24]. In thiswork, we extract pairs of excerpts from audio ﬁles and rely on themto learn a style representation as detailed in the following section.

4. METHOD

Given the goal of mapping two inputs – the content input x and thestyle input y – to an output, it is natural to deﬁne an encoder-decodermodel with two encoders (one for each input) and a single decoder. Itremains to describe how to train this model, and in particular, how toensure the mutual disentanglement of the style and content features.Our proposal, illustrated in Fig. 1, rests on two key points: ( i ) We use a discrete representation c , . . . , c L for content andtrain the model to reconstruct the content input, x ; hence, thecontent encoder together with the decoder form a VQ-VAE.This is motivated by the success of the VQ-VAE on voiceconversion as mentioned in Section 3.1. ( ii ) The output of our style encoder is a single continuous-valued embedding vector s . To ensure that the style encoder only en-codes style (i.e. to make it content-independent), we employa simple self-supervised learning strategy where we feed adifferent input y to the style encoder such that x and y aredifferent segments of the same audio recording (with somedata augmentation applied; see Section 4.1 for details).These choices are complementary to each other, as we will now see.Firstly, ( i ) necessarily means that the content encoder will dropsome information from the content representation c . Since this alonedoes not guarantee that only content information will be preserved,(ii) is introduced to guide the encoder to do so. Our reasoning isthat providing a separate style representation, not constrained by thediscretization bottleneck, should make it unnecessary to also encodestyle information in c .Secondly, it can be expected that in a trained model, only infor-mation useful for reconstructing x will inﬂuence the output. Hence,due to ( ii ) and provided that x and y do not share any content in-formation, we expect s to only encode style. Also note that the dis-cretization bottleneck in ( i ) is key for learning a useful style repre-sentation s : without it, y may be completely ignored by the model.Once trained, the model is used for inference simply by feedingthe content input and the style input to the respective encoders. Our self-supervised learning strategy consists in training on pairs ofsegments x, y where each such pair comes from a single recording.The underlying assumption is that such x and y have the same style(timbre) but different content. We combine data from two differentsources, chosen to easily satisfy this assumption:1. LMD.

The ‘full’ version of the Lakh MIDI Dataset [25]( LMD-full ), containing

178 k

MIDI ﬁles (about a year’sworth of music in a symbolic representation). We pick a ran-dom non-drum part from each ﬁle, sample two 8-second seg-ments of this part and render them as audio using a sample-based synthesizer (FluidSynth), with the SoundFont pickedrandomly out of 3 options. https://colinraffel.com/projects/lmd/ Fluid R3 GM , TimGM6mb , and

Arachno SoundFont ; see [26] RT.

A set of audio tracks from PG Music; speciﬁcally, the RealTracks included with Band-in-a-Box UltraPAK2018. Each RealTrack (RT) is a collection of studio record-ings of a single instrument playing either an accompanimentpart or a solo in a single style. We extract pairs of short seg-ments totalling up to

20 min per RT, and clip each segmentto after performing data augmentation (see below).We perform two kinds of data augmentation. Firstly, we trans-pose each segment from LMD up or down by a random interval (upto 5 semitones) prior to synthesis; this ensures that the two segmentsin each pair have different content, but does not affect their timbre.Secondly, we apply a set of random timbre-altering transforma-tions to increase the diversity of the data:• (LMD only.)

Randomly changing the MIDI program (instru-ment) to a different one from the same broad family of in-struments (keyboards & guitars; basses; winds & strings; . . . )prior to synthesis.• (RT only.)

Audio resampling, resulting in joint time-stretch-ing and transposition by up to ± semitones.• – audio effects, drawn from reverb, overdrive, phaser, andtremolo, with randomly sampled parameters.An identical set of transformations is applied to both examples ineach pair to ensure that their timbres do not depart from each other.After this procedure, we end up with

209 k training pairs (

119 k from LMD and

90 k from RT).

We represent the audio signal as a log-scale magnitude STFT (short-time Fourier transform) spectrogram with a hop size of /

32 s and frequency bins. To obtain the output audio, we invert the STFTusing the Grifﬁn–Lim algorithm [27].The model architecture is depicted in Fig. 2. The encoders treatthe spectrogram as a 1D sequence with channels and processit using a series of 1D convolutional layers which serve to down-sample it (i.e. reduce its temporal resolution). The last layer of thestyle encoder is a GRU (gated recurrent unit [28]) layer, whose ﬁnalstate s (a -dimensional vector) is used as the style representa-tion. This vector s is then fed to the st and the th decoder layer byconcatenating it with the preceding layer’s outputs at each time step.The decoder consists of 1D transposed convolutional layerswhich upsample the feature sequence back to the original resolu-tion. GRU layers are inserted to combine the content and stylerepresentations in a context-aware fashion.We train the model using Adam [29] to minimize the VQ-VAEloss from Eq. (4), deﬁning the reconstruction loss L ae as the meansquared error between x and ˆ x . We train for 32 epochs, taking about20 hours in total on a Tesla V100 GPU.

5. EXPERIMENTS

As in previous music style transformation work [31, 7], we wish toevaluate our method on two criteria: (a) content preservation and (b)style ﬁt. In timbre transfer, these should express (a) how much of thepitch content of the content input is retained in the output, and (b)how well the output ﬁts the target timbre. To this end, we propose The ﬁnal number is lower than the number of ﬁles in LMD due to corruptMIDI ﬁles and parts with insufﬁciently many notes being discarded. x conv [4 , × conv [1 , + VQ [ K = 2048] y conv [4 , conv [1 , + GRUconv ⊤ [1 , ﬁnalstateGRU + conv ⊤ [4 , × conv ⊤ [1 , GRU + max(0 , · )ˆ x Contentencoder StyleencoderDecoder

Fig. 2 . The model architecture. All convolutions are 1D, with thekernel size and stride shown. All layers have channels, exceptfor the last two (conv ⊤ & GRU), which have (the number offrequency bins). All layers except for the input layers and the VQare preceded by batch normalization and a Leaky ReLU activation[30]. conv ⊤ stands for transposed convolution.the following objective metrics for measuring pitch and timbre dis-similarity, respectively, between an output and a reference recording:(a) Pitch:

We extract pitch contours from both recordings usinga multi-pitch version of the MELODIA algorithm [32] imple-mented in the Essentia library [33]. We round the pitches tothe nearest semitone and express the mismatch between thetwo pitch sets

A, B at each time step as the Jaccard distance: d J ( A, B ) = 1 − | A ∩ B || A ∪ B | We report the mean value of this quantity over time.(b)

Timbre:

Mel-frequency cepstral coefﬁcients (MFCCs) 2–13are generally considered to be a good approximate timbre rep-resentation [34]. Since they are computed on a per-frame ba-sis, we train a triplet network [35] on top of them to aggregatethem over time and output a single dissimilarity score. Moredetails can be found on the supplementary website. We compare our method to 2 trivial baselines and 2 baselines fromthe literature:• CP - CONTENT : Copies the content input to the output.• CP - STYLE : Copies the style input to the output.•

U+L : The algorithm of Ulyanov and Lebedev [15] (notspeciﬁcally designed for timbre transfer), consisting in opti-mizing the output spectrogram for a content loss and a styleloss. We tune the ratio of the weights of the two losses ona small synthetic validation set to minimize the log-spectraldistance (LSD) to the ground truth (see Section 5.1).•

Musaicing : A freely available implementation of the mu-saicing algorithm of Driedger et al. [11]. https://github.com/ctralie/LetItBee/ rtiﬁcial RealSystem LSD T Timbre T Pitch T Timbre S Pitch C CP - CONTENT CP - STYLE

Musaicing [11] 14.51 0.2933 0.6445 0.2319 0.6297This work

Table 1 . Evaluation results. Distances marked S, C, and T are com-puted w.r.t. the style input, the content input, and the synthetic target,respectively. Results that are trivially 0 are omitted.

First, we evaluate our method on a synthetic dataset generated fromMIDI ﬁles. Although such data is not completely realistic, it en-ables conducting a completely objective benchmark by comparingthe outputs to a synthetic ground truth.We generate the data from the Lakh MIDI Dataset (LMD) sim-ilarly as in Section 4.1, but using a set of ﬁles held out from thetraining set, and with no data augmentation. We use the

Timbres OfHeaven

SoundFont (see [26]), not used for the training set.We randomly draw content–style input pairs and generate acorresponding ground-truth target for each pair by synthesizing thecontent input using the instrument of the style input. More detailsare given on the supplementary website. Both the pitch and timbre distance are measured with respect tothe ground-truth target. Additionally, we measure an overall distanceto the target as the root-mean-square error computed on dB -scalemel spectrograms; this is known as the log-spectral distance or LSD . We create a more realistic test set based on the ‘Mixing Secrets’ au-dio library [36], containing over multi-track recordings fromvarious (mostly popular music) genres. After ﬁltering out multi-instrument, vocal and unpitched percussion tracks, we extract 690content-style input pairs similarly as in Section 5.1. As no groundtruth is available in this dataset, we compute the pitch and timbremetrics with respect to the content and style input, respectively.

The results of both benchmarks are shown in Table 1. First, our sys-tem outperforms all baselines on LSD and the timbre metric. Thedifference to the CP - CONTENT baseline is negative in more than

75 % of examples on both of these metrics and in both benchmarks.Hence, viewing our system as a timbre transformation applied to thecontent input, we can conclude that, informally speaking, the trans-formation is at least partly successful in more than

75 % of cases.We may also notice that the result of CP - STYLE on timbre is, some-what counter-intuitively, outperformed by our system. This may bea sign that the timbre metric is still somewhat inﬂuenced by pitch.Turning to the pitch distance metric, we note that its values seemrather high ( > . on a scale from 0 to 1). However, most of this er-ror should be attributed to the pitch tracking algorithm rather than tothe systems themselves. This is documented by the fact that the pitchdistance of CP - CONTENT to the ground-truth target is . insteadof . Another useful value to look at is the result of CP - STYLE : asthe style input is selected randomly, its pitch distance value should be high, and is indeed close to . . Using these two points of ref-erence, we observe that our system’s result is much closer to theformer than to the latter in both benchmarks, which is the desiredoutcome. Moreover, it outperforms the musaicing baseline in bothcases, albeit only slightly on real inputs.

6. DISCUSSION

Our subjective observations upon examining the outputs mostlymatch the objective evaluation. We ﬁnd that, although the soundquality of our outputs is not nearly perfect, their timbre typicallydoes sound much closer to the style input than to the content input.(Low synthesis quality and various artifacts are somewhat expected,as they are a common occurrence with the Grifﬁn-Lim algorithm, aswell as decoders based on transposed convolutions [37]. However,synthesis quality is not the main focus of this preliminary work.)The pitch of the content input is generally well preserved in theoutput, yet faster notes and polyphony seem to pose a problem. Webelieve this is caused by a low capacity of the discrete content rep-resentation. Even though a codebook size of seems more thansufﬁcient in theory, we found that on both of our test sets combined,only of the codebook vectors are actually used in practice. Thismeans, for example, that at a tempo of

120 BPM , only . bits ofinformation can be encoded per beat. This ‘codebook collapse’ [38]is a known issue with VQ-VAEs.We also observe that our method works better on target instru-ments with a temporally ‘stable’ sound, e.g. piano; this might alsoexplain why our method achieves better evaluation results on syn-thetic inputs (generated using samples) than on real ones, which areless predictable. A likely culprit is our use of a deterministic model,which cannot possibly capture the acoustic variability of instrumentslike saxophone or violin while being able to convert from an instru-ment that lacks this variability. This could be remedied by replacingour decoder with a probabilistic one which models a fully expressiveconditional distribution, such as WaveNet [39].The musaicing baseline, which uses fragments from the style in-put to construct the output, generally matches the target timbre veryprecisely, but is often less musically correct than ours. For example,note onsets tend to lack clear attacks; pitch errors and spurious notesoccur, especially when the style input is non-monophonic or fast.Finally, let us comment on the U+L baseline. Although its re-sults on pitch are excellent, this is caused by the fact that the styleweight obtained by tuning is very low (about times lower thanthe content weight), causing the algorithm to behave much like CP - CONTENT . This is also reﬂected by the timbre metric. Experiment-ing with higher weights, we notice that the algorithm is able to trans-fer fragments of the style input to the output, but cannot transpose(pitch-shift) them to match the content input.

7. CONCLUSION

We have proposed a novel approach to one-shot timbre transfer,based on an extension of the VQ-VAE, along with a simple self-supervised learning strategy. Our results demonstrate that themethod constitutes a viable approach to the timbre transfer taskand is able to outperform baselines from the literature.The most important shortcoming of our method seems to be theuse of a deterministic decoder. We believe that a more expressivedecoder such as WaveNet should allow improving the performanceespecially on instruments with great temporal variability, and per-haps enable extensions to more challenging style transfer tasks, suchas arrangement or composition style transfer.4 . REFERENCES [1] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transferusing convolutional neural networks,” in

CVPR , 2016.[2] X. Huang and S. Belongie, “Arbitrary style transfer in real-timewith adaptive instance normalization,” in

ICCV , 2017.[3] A. P. Noam Mor, Lior Wold and Y. Taigman, “A universalmusic translation network,” in

ICLR , 2019.[4] S. Huang, Q. Li, C. Anil, X. Bao, S. Oore, and R. B. Grosse,“TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) pipelinefor musical timbre transfer,” in

ICLR , 2019.[5] J. Engel, L. H. Hantrakul, C. Gu, and A. Roberts, “DDSP:Differentiable digital signal processing,” in

ICLR , 2020.[6] A. Bitton, P. Esling, and T. Harada, “Vector-quantized timbrerepresentation,” arXiv preprint arXiv:2007.06349 , 2020.[7] O. C´ıfka, U. S¸ims¸ekli, and G. Richard, “Groove2Groove: One-shot music style transfer with supervision from synthetic data,”

IEEE/ACM Trans. on Audio, Speech, and Lang. Proc. , vol. 28,pp. 2638–2650, 2020.[8] Z. Wang, D. Wang, Y. Zhang, and G. Xia, “Learning inter-pretable representation for controllable polyphonic music gen-eration,” in

ISMIR , 2020.[9] S. Nercessian, “Zero-shot singing voice conversion,” in

ISMIR ,2020.[10] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neuraldiscrete representation learning,” in

NIPS , 2017.[11] J. Driedger, T. Pr¨atzlich, and M. M¨uller, “Let it Bee – towardsNMF-inspired audio mosaicing,” in

ISMIR , 2015.[12] C. J. Tralie, “Cover song synthesis by analogy,” in

ISMIR ,2018.[13] H. Foroughmand and G. Peeters, “Music retiler: UsingNMF2D source separation for audio mosaicing,” in

AudioMostly 2018 on Sound in Immersion and Emotion (AM’18) ,2018.[14] A. Zils and F. Pachet, “Musical mosaicing,” in

COST G-6Conference on Digital Audio Effects (DAFX-01) , 2001.[15] D. Ulyanov and V. Lebedev, “Audio texture synthesis andstyle transfer,” online (accessed Sep 29, 2020): https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/ .[16] E. Grinstein, N. Q. K. Duong, A. Ozerov, and P. P´erez, “Audiostyle transfer,” in

ICASSP , 2018.[17] P. Esling, A. Chemla-Romeu-Santos, and A. Bitton, “Bridg-ing audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces,” in

ISMIR , 2018.[18] Y.-J. Luo, K. Agres, and D. Herremans, “Learning disentan-gled representations of timbre and pitch for musical instrumentsounds using Gaussian mixture variational autoencoders,” in

ISMIR , 2019.[19] A. Bitton, P. Esling, A. Caillon, and M. Fouilleul, “Assistedsound sample generation with musical conditioning in adver-sarial auto-encoders,” in

Proceedings of the 22nd InternationalConference on Digital Audio Effects (DAFx-19) , 2019.[20] Y.-J. Luo, K. W. Cheuk, T. Nakano, M. Goto, and D. Herre-mans, “Unsupervised disentanglement of pitch and timbre forisolated musical instrument sounds,” in

ISMIR , 2020. [21] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, “Efﬁcientestimation of word representations in vector space,” in

ICLR ,2013.[22] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.Efros, “Context encoders: Feature learning by inpainting,” in

CVPR , June 2016.[23] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised rep-resentation learning by predicting image rotations,” in

ICLR ,2018.[24] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learningand using the arrow of time,” in

CVPR , June 2018.[25] C. Raffel,

Learning-based methods for comparing sequences,with applications to audio-to-MIDI alignment and matching ,Ph.D. thesis, Columbia University, 2016.[26] “SoundFonts and SFZ ﬁles,” in

Handbook for MuseScore3 . MuseScore, online (accessed Sep 26, 2020): https://musescore.org/en/handbook/soundfonts-and-sfz-files .[27] D. Grifﬁn and J. Lim, “Signal estimation from modiﬁed short-time fourier transform,”

IEEE Trans. on Acoustics, Speech,and Signal Proc. , vol. 32, no. 2, pp. 236–243, 1984.[28] K. Cho, B. van Merrienboer, C¸ aglar G¨ulc¸ehre, D. Bahdanau,F. Bougares, H. Schwenk, and Y. Bengio, “Learning phraserepresentations using RNN encoder-decoder for statistical ma-chine translation,” in

EMNLP , 2014.[29] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” in

ICLR , 2015.[30] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectiﬁer nonlin-earities improve neural network acoustic models,” in

ICML ,2013.[31] O. C´ıfka, U. S¸ims¸ekli, and G. Richard, “Supervised symbolicmusic style translation using synthetic data,” in

ISMIR , 2019.[32] J. Salamon and E. G´omez, “Melody extraction from poly-phonic music signals using pitch contour characteristics,”

IEEE Trans. on Audio, Speech, and Lang. Proc. , vol. 20, no.6, pp. 1759–1770, 2012.[33] D. Bogdanov, N. Wack, E. G´omez Guti´errez, S. Gulati,H. Boyer, O. Mayor, G. Roma Trepat, J. Salamon, J. R. Za-pata Gonz´alez, X. Serra, et al., “Essentia: An audio analysislibrary for music information retrieval,” in

ISMIR , 2013.[34] G. Richard, S. Sundaram, and S. Narayanan, “An overviewon perceptually motivated audio indexing and classiﬁcation,”

Proceedings of the IEEE , vol. 101, no. 9, pp. 1939–1954, 2013.[35] E. Hoffer and N. Ailon, “Deep metric learning using triplet net-work,” in

International Workshop on Similarity-Based PatternRecognition . Springer, 2015, pp. 84–92.[36] “The ‘Mixing Secrets’ free multitrack download library,”online (accessed Sep 25, 2020): .[37] J. Pons, S. Pascual, G. Cengarle, and J. Serr`a, “Upsamplingartifacts in neural audio synthesis,” in

ICASSP , 2021.[38] S. Dieleman, A. van den Oord, and K. Simonyan, “The chal-lenge of realistic music generation: modelling raw audio atscale,” in

NeurIPS , 2018.[39] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, andK. Kavukcuoglu, “WaveNet: A generative model for raw au-dio,” in