Supervised Symbolic Music Style Translation Using Synthetic Data
SSUPERVISED SYMBOLIC MUSIC STYLE TRANSLATION USINGSYNTHETIC DATA
Ondˇrej Cífka Umut ¸Sim¸sekli Gaël Richard
LTCI, Télécom Paris, Institut Polytechnique de Paris{ ondrej.cifka , umut.simsekli , gael.richard } @telecom-paris.fr ABSTRACT
Research on style transfer and domain translation hasclearly demonstrated the ability of deep learning-based al-gorithms to manipulate images in terms of artistic style.More recently, several attempts have been made to extendsuch approaches to music (both symbolic and audio) in or-der to enable transforming musical style in a similar man-ner. In this study, we focus on symbolic music with the goalof altering the ‘style’ of a piece while keeping its original‘content’. As opposed to the current methods, which areinherently restricted to be unsupervised due to the lack of‘aligned’ data (i.e. the same musical piece played in multi-ple styles), we develop the first fully supervised algorithmfor this task. At the core of our approach lies a syntheticdata generation scheme which allows us to produce virtu-ally unlimited amounts of aligned data, and hence avoidthe above issue. In view of this data generation scheme,we propose an encoder-decoder model for translating sym-bolic music accompaniments between a number of differ-ent styles. Our experiments show that our models, al-though trained entirely on synthetic data, are capable ofproducing musically meaningful accompaniments even forreal (non-synthetic) MIDI recordings.
1. INTRODUCTION
Artistic style transfer has become a well-established topicin the computer vision literature and is becoming of in-creasing interest in other areas of computer science, espe-cially music and natural language processing. More gen-erally, we are dealing with a family of style transforma-tion tasks, where the goal is to alter the style of a piece ofdata (e.g., an image, a musical piece, a document) whilepreserving – to some extent – its content . In the musicdomain, a solution to these problems would have excitingindustrial applications, not only as a way to generate newmusic automatically (as an alternative to fully automaticmusic composition, which still seems to be a distant goal),but also as a tool for music creators, allowing them to eas-ily incorporate new styles and ideas into their work. © Ondˇrej Cífka, Umut ¸Sim¸sekli, Gaël Richard. Licensedunder a Creative Commons Attribution 4.0 International License (CC BY4.0).
Attribution:
Ondˇrej Cífka, Umut ¸Sim¸sekli, Gaël Richard. “Su-pervised Symbolic Music Style Translation Using Synthetic Data”, 20thInternational Society for Music Information Retrieval Conference, Delft,The Netherlands, 2019.
In computer vision, the most popular task in this direc-tion is style transfer , where the algorithm has two inputs:the ‘content’ image to transform and a ‘style’ image, bear-ing the style that we wish to impose on (or transfer to)the content image. On the other hand, work done on mu-sic so far has mostly focused on a different task, whichwe refer to as style translation . Contrary to style transfer,only the ‘content’ input is given, and the goal is to renderit in a target style which is known in advance and usually learned from a large set of examples. Note that althoughthis second task is often also referred to as ‘style trans-fer’ in the context of music and text generation, we claimthat this conflicts with how the term is traditionally under-stood [11,13,38], and that the term ‘translation’ is more ap-propriate and in line with other prior work [17, 24, 28, 40].The focus of our work is on the latter task, and morespecifically, on accompaniment style translation for sym-bolic music . In particular, given a piece of music in a sym-bolic representation, our goal is to generate a new accom-paniment for it in a different arrangement style while pre-serving the original harmonic structure. Even though ourapproach is generic, to narrow down our scope, we focuson generating bass and piano tracks.A major difficulty of the music style translation task isthat there are no publicly available ‘aligned’ or ‘parallel’datasets (containing examples of the same music played indifferent styles). As a result, recent works closely relatedto ours [4, 5] have adopted unsupervised learning frame-works – variational autoencoders (VAE) [19] and Cycle-GANs [40] – and applied them to genre-labeled datasets.However, these extensions to symbolic music have not yetpermitted to obtain results as compelling as those on im-ages [22, 40], text [20, 39], and music audio [28].In this study, we adopt a different strategy to overcomethe lack of aligned data, which is to synthesize it. Syn-thetic training data has proven useful for music informa-tion retrieval tasks such as chord recognition [21] and fun-damental frequency estimation [25, 32], and is also pop-ular for tasks like semantic segmentation in computer vi-sion [30, 36]. In our case, synthetic data opens up the pos-sibility for supervised learning techniques known from themachine translation field. Moreover, it allows us to workwith fine-grained style labels, as opposed to genre labels,which may be too vague or ambiguous for such purposes.Our main contributions are as follows:• We propose a supervised, end-to-end neural modelfor symbolic music style translation, along with a a r X i v : . [ c s . S D ] J u l raining data generation scheme.• Our model is able to translate into a large numberof different styles by conditioning a single decoderon the target style. To our knowledge, this is thefirst time this technique has been applied to musictranslation with some success.• To evaluate the performance of our model, we pro-pose an objective metric of music style similarity.• We show that an approach to music style translationbased entirely on synthetic data is viable and gen-eralizes well to more ‘natural’ inputs, even in unre-lated styles.We believe that our approach will foster new directions inthis line of research; some of these will be briefly discussedin the conclusion. The source code of our system, builtusing TensorFlow, is available online.
2. RELATED WORK
The work performed so far in the area of music style trans-formation is relatively small in volume but fairly diverse,since, as noted in [8], the transformations can work withdifferent music representations as well as on different con-ceptual levels.To our knowledge, the only work on music style trans-fer – in the original sense, as discussed in the introduction –has been done on audio. Some approaches [9, 35] com-bine signal decomposition techniques with musaicing [41](a form of concatenative synthesis). In [14], the authorsattempt to transfer ‘sound textures’ from a recording bymeans of techniques adapted from image style transfer,but without specific focus on the musical aspects. In bothcases, the transformation is largely limited to timbre.The problem of unsupervised music audio translation istackled in [28], where the authors train a neural networkto translate between a number of domains. For symbolicmusic, style translation is studied in [4, 5], adapting unsu-pervised learning techniques from computer vision. A dif-ferent approach is proposed in [23], consisting in training amodel on the target style only and then using pseudo-Gibbssampling to transform a given piece of music.Finally, we should mention more ‘constrained’ prob-lems from the symbolic music domain which can also beframed as style translation tasks, e.g. (re-)harmonization[16,29] and expressive performance generation [12,24,37].
3. SYNTHETIC DATA GENERATION
Since we are in a supervised setting, our approach requiresa large amount of paired examples where each pair consistsof one musical fragment arranged in two different styles.Given that no such dataset is currently available, we cre-ated a synthetic one, generated using RealBand from theBand-in-a-Box (BIAB) software package [2].First, we downloaded chord charts of around 3.5Ksongs in the BIAB format from a popular online archive[3]. We used BIAB to generate arrangements of thesesongs in different styles and filtered the resulting MIDI https://git.io/musicstyle files to keep only those in or time. We thenchopped those files into segments of 8 bars, splitting notesthat overlap segment boundaries.We selected a total of 70 styles from the ‘0 MIDI’ and‘1 MIDI’ style packs included in Band-in-a-Box 2018, rep-resenting a wide variety of popular music genres. Eachstyle contains up to 5 accompaniment tracks (drums, bass,piano, guitar, strings). We generated each song in 3 ran-domly picked styles, providing × (cid:0) (cid:1) = 6 training pairsper segment, or around 658K training examples in total.An example of a possible training pair is shown in Fig. 1.In all experiments, we used 2,809 songs for training,46 songs as a validation set and 46 songs for evaluation,each in 3 examples in different styles. The song names,along with the styles used for each song, are included inthe supplementary material [7].
4. PROPOSED MODEL
We propose an architecture based on RNN encoder-decoder sequence-to-sequence models with attention [1],commonly employed in machine translation and other ar-eas of natural language processing. This choice is moti-vated by the successes of RNNs on symbolic music gen-eration [10, 15, 33, 34] and by the ability of the attentionmechanism to condition the generation on arbitrary inputdata without a prior alignment.Our model is designed so that it is capable of translat-ing music between a potentially large number of differentstyles. This is achieved by conditioning the decoder onthe target style. An obvious advantage of this design isefficiency: to translate between n styles, we only need totrain a single model, compared to n models (one for eachtarget style; possibly with a shared encoder as in [28]) oreven Θ( n ) models (one for each pair of styles, e.g. [4,5]).Other implications of this choice are investigated in Sec-tion 6.2.On the other hand, to simplify the task and facilitateevaluation, we train a dedicated model for each target in-strument track. Our output representation and decoder ar-chitecture are chosen accordingly and would not necessar-ily be suitable for generating several independent tracks. Input and output representation.
A common choiceof representation of symbolic non-monophonic music forneural processing is a piano roll. We use a binary-valuedpiano roll with 128 pitches and 4 columns per beat (quarternote) to encode our input.For representing the output (and also as an alternativeinput representation), we opted for a MIDI-like encoding,which – unlike a piano roll – is straightforward to modelusing an RNN decoder. Specifically, following [33], weencode the music as a sequence of 3 types of events, eachwith one integer argument:•
NoteOn( pitch ) : start a new note at the given pitch; The time signature depends on the style as well as on the song itself.A song originally in may have a arrangement and vice versa. These 5 labels are not always accurate; for example, some styles havetwo guitar tracks, one of which is labeled as piano. (cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:1) (cid:2)(cid:2) (cid:0)(cid:0) (cid:0)(cid:3)(cid:1) (cid:0)(cid:0)(cid:4)(cid:0) (cid:2)(cid:0)(cid:5) (cid:0)(cid:6)(cid:0)(cid:0)(cid:0)(cid:6)(cid:6)(cid:3) (cid:5) (cid:0)(cid:0)(cid:0) F (cid:7)(cid:0)(cid:0) (cid:0)(cid:2) (cid:2)(cid:2)(cid:2)(cid:2)(cid:0)(cid:1) (cid:0)(cid:6)(cid:6)(cid:6)(cid:6) (cid:0)(cid:0)(cid:0)(cid:6)(cid:0)(cid:0)(cid:0)(cid:6)(cid:6)(cid:6) (cid:7)(cid:5)(cid:0) (cid:1) (cid:6)(cid:6)(cid:6)(cid:6)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:8)(cid:0)(cid:8)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0) F (cid:4)(cid:4)(cid:4)(cid:5)(cid:2)(cid:2)(cid:2)(cid:0)(cid:0)(cid:0)(cid:0)(cid:6)(cid:6)(cid:6) (cid:0) (cid:0) (cid:8)(cid:0)44 (cid:9)(cid:10) 44 (cid:0)(cid:10) (cid:9) Swing (cid:2)(cid:2)(cid:2) (cid:0) C (cid:4)(cid:4)(cid:4)(cid:11) 44 (cid:0) (cid:0) (cid:8)(cid:0)(cid:1) (cid:0)(cid:0) (cid:0) (cid:2)(cid:1)(cid:1) (cid:2)(cid:2)(cid:0)(cid:0)(cid:0)(cid:0) C (cid:1) (cid:2)(cid:0)(cid:0)(cid:7)(cid:0) (cid:0)(cid:0)(cid:9)(cid:9)(cid:9)(cid:9) C (cid:6)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:1) (cid:0)(cid:0) (cid:1) (cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0) (cid:0)(cid:1)(cid:2) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:2) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) F (cid:3) (cid:0)(cid:0)(cid:1) (cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:1) (cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:1) (cid:4) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:2)(cid:2) (cid:0)(cid:2)(cid:2)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:1) (cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1) (cid:0)(cid:5)(cid:5)(cid:5)(cid:5) F (cid:0) (cid:0)(cid:0)(cid:0)(cid:0) (cid:1)(cid:0)(cid:5)(cid:5)(cid:5) C (cid:0)(cid:0)(cid:5)(cid:0)(cid:0)(cid:0) (cid:0)(cid:1)(cid:1)(cid:2) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:6) (cid:0)(cid:0)44(cid:7) (cid:0)(cid:0) C (cid:0) Samba (cid:6) (cid:0)44 (cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:8)(cid:8)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:1) (cid:1) (cid:0)(cid:0)(cid:0) (cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:2)(cid:2) (cid:0)(cid:5)(cid:5)(cid:5)(cid:5)(cid:1)(cid:2) (cid:0)(cid:0)(cid:0) (cid:1)(cid:1)(cid:0) (cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0) (cid:1)(cid:1)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0) C (cid:1)(cid:1)(cid:1) (cid:0) (cid:0)(cid:0)(cid:0) (cid:1)(cid:1)(cid:1)(cid:0)(cid:2) Figure 1 : Six bars of an accompaniment (piano and bass) for a 12-bar blues, generated using BIAB in a ‘jazz swing’ style(top) and a ‘samba’ style (bottom). The timing is only approximate. The input chord sequence is displayed at the top. (cid:0)(cid:1) (cid:2) (cid:0)(cid:3)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:3) (cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:2)(cid:2)44 (cid:3)(cid:5) (cid:4)(cid:6)44(cid:7) (cid:6) (cid:0) (cid:3) (cid:3)(cid:8)(cid:8)(cid:8)(cid:8)(cid:3)(cid:3)(cid:3)
NoteOn(50) TimeShift(9) NoteOn(60) NoteOn(65)NoteOn(69) NoteOn(76) TimeShift(12) NoteOff(60)NoteOff(65) NoteOff(69) NoteOff(76) TimeShift(3)NoteOff(50) NoteOn(43) NoteOn(59) NoteOn(65)NoteOn(69) NoteOn(76) TimeShift(24) NoteOff(All)
Figure 2 : A bar of music, represented as a piano roll (topright) and as a sequence of 20 event tokens (bottom).•
NoteOff( pitch ) : end the note at the given pitch;• TimeShift( delta ) : move forward in time by thespecified amount, measured in 12ths of a beat. NoteOn and
NoteOff take values in the range 0–127,whereas
TimeShift is within 1–24. In contrast to [33],our representation is tempo-invariant and we do not modeldynamics. Fig. 2 illustrates both representations.
Model architecture and training.
The proposed modelconsists of an encoder and a decoder; the former servesto compute a dense representation of the input, while thelatter generates the output event sequence, conditioned onthe encoded input and the target style.The architecture of the encoder depends on the type ofinput representation:• If the input is a piano roll, we use a two-layer convo-lutional network (CNN), followed by a bidirectionalRNN with a gated recurrent unit (GRU) [6]. TheCNN serves to compress the input, resulting in a se-quence of 1280-dimensional vectors with 2 vectorsper bar. The bidirectional GRU then adds the abilityto incorporate information from a wider context.• If the input is a sequence of tokens, we use an em-bedding layer, also followed by a bidirectional GRU.We refer to the two variants of the model as ‘roll2seq’ and‘seq2seq’, respectively.The decoder is also implemented using a GRU, condi-tioned on the target style and equipped with a feed-forward When encoding the piano track, we compress the sequences by alsoincluding a
NoteOff(All) event which ends all currently active notes. c s s s s T (cid:48) + h fw h bw h fw h bw h fw h bw h fw T h bw T · α · α · α · α T . . . z z y y z y y y zy T (cid:48) − c c c T (cid:48) Figure 3 : The attention-based decoder. During the i -th de-coding step (here i = 3 ), a set of coefficients α ij is com-puted and used to weight the encoder states h j = [ h fw j , h bw j ] to obtain the context vector c i , which in turn is used as in-put for the decoder cell to compute the next state, s i .attention mechanism [1] acting on the encoder outputs.More precisely, as illustrated in Fig. 3, the i -th decoderstate s i is computed as s i = GRU ([ c i , W s z, W e y i − ] , s i − ) , where [ · ] denotes concatenation, z and y i − respectivelydenote the one-hot encoded representations of the targetstyle and the previous output event, W s , W e are the corre-sponding embedding matrices, and c i is the context vector.The latter is a weighted average of the encoder outputs,computed by the attention mechanism. The purpose of at-tention is to provide an alignment between the encoder anddecoder states. The need for this alignment arises from thefact that the positions in the output sequence are not lin-ear in time (due to the chosen encoding), and the decodertherefore needs to be able to move its focus flexibly overthe input. For a complete description of attention, see [1].The training pipeline is portrayed in Fig. 4b. Each train-ing example consists of a song segment x in one style (thesource style) along with the corresponding segment y in adifferent style ( z , the target style). We train the model byminimizing the loss on y while passing x to the encoderand conditioning the decoder on z .The models are trained using Adam [18] with learningrate decay and with early stopping on the development set.Our configuration files with complete hyperparameter set- a) Data generation (b)
Training
BIABchord chart encoderdecodersource x target y s ou r ce s t y l e t a r g e t s t y l e z z Figure 4 : A scheme of the training pipeline. (a) We useBIAB to generate each song in different arrangement styles(see Section 3). (b) The model is trained to predict thetarget-style segment y given a source segment x and thetarget style z (see Section 4).tings are included with the source code.Once the model is trained, we perform style translationusing greedy decoding, i.e. by taking the most likely outputtoken at every step (and using that as input in the next step).We also explored random sampling with different softmaxtemperatures, but found that this leads to a higher numberof errors (i.e. invalid sequences or incorrect timing) anddoes not significantly improve the quality of the outputs.
5. EVALUATION METRICS
When evaluating a style transformation, we need to con-sider two complementary criteria: how well the trans-formed music fits the desired style ( style fit ) and how muchcontent it retains from the original ( content preservation ).Note that it is trivial (but useless) to achieve perfect resultson either of these two criteria alone , so it is essential toevaluate both of them.In this section, we describe ‘objective’, automaticallycomputed metrics for both criteria. Even though we be-lieve these metrics are sound and well-motivated, we ac-knowledge the limitations of automatic metrics in generaland encourage the reader to listen to the provided exampleoutputs [7] to get a real sense of their quality.
Content preservation.
We use a content preservationmetric similar to the one proposed by [23], computed bycorrelating the chroma representation of the generated seg-ment with that of the corresponding segment in the sourcestyle. This is motivated by the fact that we expect the out-put to follow the same sequence of chords as the input.More precisely, we compute chroma features for each seg-ment at a rate of 12 frames per beat and smooth each ofthem using an averaging filter with a window size of 2beats (24 frames) and a stride of 1 beat (12 frames). Fi-nally, we calculate the average frame-wise cosine similar-ity between the two sets of chroma features.
Style fit.
In some of the recent music style transformationworks [4,5], the quality of a transformation is measured bymeans of a binary style classifier trained on a pair of styles. However, the merit of such evaluation is limited, since ahigh classifier score merely demonstrates that the outputhas some of the distinguishing features of the target style,and not necessarily that it actually fits the style. For thisreason, we aim for a more interpretable metric of style fit.As observed by [16, 26, 31], musical style is well cap-tured in pairwise statistics between neighboring events.Drawing inspiration from the features proposed in [26], wedevise a key- and time-invariant style representation whichwe call the style profile .To compute the style profile, we consider all pairs ofnote onsets less than 4 beats apart and at most 20 semitonesapart, and record the time difference and interval for eachpair. In other words, we define the following multiset ofordered pairs: S = { ( t b − t a , p b − p a ) | a, b ∈ notes , a (cid:54) = b, ≤ t b − t a < , | p b − p a | ≤ } , where t x is the onset time of the note x (measured in frac-tional beats) and p x is its MIDI note number. We thenobtain the style profile as a normalized 2D histogram of S with 6 bins per beat and one bin per semitone, and flattenit to get a 984-dimensional vector.Finally, to quantify the style fit of a particular set of out-puts, we compute their style profile and measure its cosinesimilarity to a reference profile. Note that an 8-bar segmentmay not be sufficient to obtain a reliable style profile; in-stead, we always aggregate the statistics over a number ofsegments. In particular, we put forward two variants of thestyle fit metric, obtained as follows:(a) Compute a style profile aggregated over all outputsof a model in a given target style and measure itscosine similarity to the reference.(b) Compute a style profile for each translated song sep-arately and measure its cosine similarity to the ref-erence. We report the mean and standard deviationover all songs.We refer to (a) and (b) as ‘macro-style’ and ‘song-style’,respectively. In both cases, the reference style profile isextracted from the training set, separately for each track.While we do not claim that this metric is able to dis-tinguish between broad style categories (such as genres),yet it can definitely capture the differences and similaritiesbetween specific ‘grooves’, which makes it well-suited forour purpose. This is illustrated in Fig. 5, showing the pair-wise similarities between the profiles of the bass tracks ofdifferent BIAB styles, with clearly visible clusters of jazz,rock or country styles.
6. EXPERIMENTAL RESULTS
In our experiments, we focus on generating the bass andpiano tracks, and we train a dedicated model for eachof them. For each track, we consider two scenarios:generating the track given only the corresponding sourcetrack (
BASS → BASS , PIANO → PIANO ), and using all non-drum accompaniment tracks from the input (
ALL → BASS , ALL → PIANO ). e d i u m R o c k H e a vy M e t a l T r i o H e a vy R o c k C h a c h a M e d . P o p P i a n o T r i o L i g h t R o c k B o ss a N o v a - I n s t r . B o ss a N o v a - I n s t r . B o ss a N o v a B o ss a N o v a I n s t r . R hu m b a M e d i u m B o ss a I r i s h M e r e n g u e E t hn i c E t hn i c E t hn i c C o un t r y / O l d T i m e E rr o l G a r n e r P i a n o J a zz F o u r s G a r n e r P i a n o T r i o J a zz S w i n g V a r i a t i o n O r i g i n a l J a zz S w i n g ZZ J A ZZ . S TY J a zz S w i n g J a zz S w i n g T r i o J a zz S w i n g Q u a r t e t B l u e b e rr y H ill R o c k a b ill y C o un t r y / f ee l C o un t r y / F ee l P o p B a ll a d / O l d P o p - F o r C o un t r y O l d C o un t r y / f ee l Medium RockHeavy Metal TrioHeavy RockChachaMed. Pop Piano TrioLight RockBossa Nova - 4 Instr.Bossa Nova - 3 Instr.Bossa NovaBossa Nova 5 Instr.RhumbaMedium BossaIrishMerengueEthnicEthnicEthnicCountry 4/4Old TimeErrol Garner PianoJazz FoursGarner Piano Trio Jazz Swing VariationOriginal Jazz SwingZZJAZZ2.STYJazz SwingJazz Swing TrioJazz Swing QuartetBlueberry HillRockabillyCountry 12/8 feelCountry 12/8 FeelPop Ballad 12/8Old Pop - For CountryOld Country 12/8 feel 0.20.40.60.81.0
Figure 5 : Pairwise cosine similarities of selected style pro-files computed on training bass tracks. The styles are or-dered based on a hierarchical clustering of the profiles.For
BASS → BASS , we compare the seq2seq and roll2seqarchitectures defined in Section 4. For all other pairs,where the input is non-monophonic, we only employroll2seq, since the sequential representation grows dispro-portionately in length in these cases and the computationalcost of the attention mechanism becomes too heavy.We evaluate our models on our synthetic test set gener-ated by BIAB and on the Bodhidharma MIDI dataset [27].The latter is a diverse collection of 950 MIDI recordingsannotated with genre labels. We filtered and pre-processedthe dataset in the same way as the synthetic test set and weextracted the bass and piano tracks. We also made extensive attempts to train the recentmodels of [4, 5, 23] on our data using the source code pub-lished by the authors, but unfortunately without success.This has prevented us from comparing these models withour proposal. Nonetheless, the provided outputs [7] canserve as a basis for perceptual comparison.
For a comprehensive evaluation of each model, we trans-lated all inputs to all 70 styles and calculated the contentpreservation and style fit metrics. The results (averaged)are presented in Fig. 6.We provide two baselines for each track (bass and pi-ano): ‘source’, which is simply the same track before thetranslation, and ‘reference’, which is a track generated byBIAB based on the chord chart (only available for the syn-thetic test set). As expected, the style fit is low for thesource track (measured with respect to the target style) andclose to 1 for the reference track. Our models’ outputs gen-erally do not fit the target style as perfectly as the referencedoes, but still score high compared to the source. To form the bass track, we retrieve all notes assigned to any Bassinstrument. For the piano track, we use the Piano and Organ classes.
As for content preservation, we can notice that the refer-ence value is quite low (0.78 for
BASS and 0.79 for
PIANO ).This should not be too surprising, since we are comparingaccompaniments in two different styles, which might havedifferent pitch-class distributions; moreover, there is somerandom harmonic variation within each style (see e.g. bars5–6 in Fig. 1). The results achieved by our models on thesynthetic test set are very close to the reference. To illus-trate the value range of the metric, we provide the resultsobtained by a ‘randomized’ baseline (shown as ‘random’in Fig. 6), where we randomly permuted the reference seg-ments for each style (obtaining a reference with the correctstyle, but the wrong content). The resulting value is verylow (0.16 for
BASS and 0.31 for
PIANO ) compared bothto the true reference and to our models, indicating that themetric is useful and the models are performing well.On Bodhidharma, content preservation is generallyweaker than on the synthetic test set. One interpretationcan be that the encoder simply fails to extract the contentinformation accurately, since it was trained on a differentdomain. However, we also find that the models often maketiming errors on Bodhidharma inputs, leading to misalign-ment between the input and the output, which may alsocause the content preservation metric to drop.On the other hand, the style fit on Bodhidharma is closeto the results on the synthetic test set (and not consistentlylower or higher), and the difference to ‘source’ (i.e. the cor-responding input track) is more marked, perhaps reflectinga higher style variability in the Bodhidharma data.Upon listening, we clearly observe that the outputs aremusical and seem to both fit the target style and followthe harmonic structure of the inputs. Besides, even thoughthe piano and the bass tracks are generated independently,they sound surprisingly coherent. However, as mentionedabove, we also observe occasional timing errors (espe-cially in heavily syncopated grooves), which become moreprominent when the bass and piano tracks are combined.A potential remedy for this issue would be to modify theencoding to make it more robust, e.g. by representing thetiming in a beat-aware manner.We also note that the single-track models output har-monically incorrect notes more often than the
ALL models;this is expected, since their input is less harmonically rich.This effect is clearly audible (especially in
BASS , whereimportant scale degrees are often missing in the input),but cannot be captured by the content preservation metric,which is computed against the same input.
All models presented so far were trained on music in 70different styles, as opposed to a single style pair. To inves-tigate the effect of this choice, we picked a pair of fairlydissimilar styles – ZZJAZZSW (‘Jazz Swing Variation’)and TWIST (‘Twist Style’, categorized as ‘Lite Pop’) –and generated a new training, validation and test set witheach song rendered in these two styles only. To increasethe amount of data, we performed this twice for each song(with different results), obtaining × training pairs ontent macro-style song-style0.00.20.40.60.81.0 synthetic BASS BASS (seq2seq)BASS BASS (roll2seq) ALL BASS (roll2seq)source referencerandomcont. m-sty. s-sty.
Bodhidharma (a)
BASS content macro-style song-style0.00.20.40.60.81.0 synthetic
PIANO PIANO (roll2seq)ALL PIANO (roll2seq) sourcereference randomcont. m-sty. s-sty.
Bodhidharma (b)
PIANO
Figure 6 : Evaluation results on content preservation and style fit. ‘Source’ is the original track (bass or piano), ‘reference’is a track generated by BIAB in the target style and ‘random’ is a random permutation of the references. For ‘song-style’,we plot the mean and the standard deviation over all songs and target styles. synthetic (ZZJAZZSW) content macro-style song-style1 1 70 70
Bodhidharma (Swing)
Figure 7 : Comparison of a single-style-pair model (1 → →
70) on the ZZJAZZSW → TWISTstyle pair.per segment.We used this new dataset to train single-style-pair ver-sions of all models (in the ZZJAZZSW → TWIST directiononly), preserving the original architectures except for theconditioning on the target style. We compare these ‘1 → →
70) on two sets of inputs:• the synthetic test set in the ZZJAZZSW style;• the ‘Swing’ section of Bodhidharma (23 songs).In Fig. 7, we show the results for the two variants of the
ALL → BASS model. While the performance on the syn-thetic data seems to be the same, the scores of the 1 → → Neural representation spaces are often found to exhibit ameaningful geometry, and our learned style embeddingspace is no exception. As an example, Fig. 8 shows aprojection of the embeddings labeled by the ‘feel’ of each
Even 8thsEven 16thsSwing 8thsSwing 16ths
Figure 8 : Style embeddings learned by the
ALL → PIANO model, labeled with ‘feel’ annotations provided by BIAB.Dimensionality reduction was performed using linear dis-criminant analysis (LDA) with the feel labels as targets.style, with ‘even’ and ‘swing’ feel styles being clearly sep-arated. We include more plots in the supplementary mate-rial and also make available an interactive visualization.
7. CONCLUSION
In this study, we focused on symbolic music accompani-ment style translation. As opposed to the current methods,which are inherently restricted to be unsupervised due tothe lack of aligned datasets, we developed the first fullysupervised algorithm for this task, leveraging the powerof synthetic training data. Our experiments show that ourmodels are capable of producing musically meaningful ac-companiments even for real MIDI recordings.We believe that these results point to interesting re-search directions. First, synthetic data seem to be an ex-cellent resource for music style translation, and could beused as a starting point even for unsupervised learning, al-lowing to validate a given approach before moving on tomore challenging, unaligned datasets. Second, our super-vised approach could be used to address more general mu-sic transformation tasks, and we are already working on anextension in this direction. https://bit.ly/2G5Jgnq . ACKNOWLEDGEMENT This research is supported by the European Union’s Hori-zon 2020 research and innovation programme under theMarie Skłodowska-Curie grant agreement No. 765068(MIP-Frontiers) and by the French National ResearchAgency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project.
9. REFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. Neural machine translation by jointly learning toalign and translate.
CoRR , abs/1409.0473, 2014.[2] Band-in-a-Box. PG Music Inc. .[3] Band-in-a-Box (BIAB) file archive. https://groups.yahoo.com/group/Band-in-a-Box-Files/ .[4] Gino Brunner, Andres Konrad, Yuyi Wang, and RogerWattenhofer. MIDI-VAE: Modeling dynamics and in-strumentation of music with applications to style trans-fer. In
ISMIR , 2018.[5] Gino Brunner, Yuyi Wang, Roger Wattenhofer, andSumu Zhao. Symbolic music genre transfer with Cy-cleGAN.
CoRR , abs/1809.07575, 2018.[6] Kyunghyun Cho, Bart van Merrienboer, ÇaglarGülçehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. Learning phrase repre-sentations using RNN encoder-decoder for statisticalmachine translation. In
EMNLP , 2014.[7] Ondˇrej Cífka. Supplementary material: Supervisedsymbolic music style translation using synthetic data.Zenodo, 2019. https://doi.org/10.5281/zenodo.3250606 .[8] Shuqi Dai, Zheng Zhang, and Gus Xia. Music styletransfer: A position paper.
CoRR , abs/1803.06841,2018.[9] Jonathan Driedger, Thomas Prätzlich, and MeinardMüller. Let it Bee – towards NMF-inspired audio mo-saicing. In
ISMIR , 2015.[10] Douglas Eck and Jürgen Schmidhuber. Finding tempo-ral structure in music: blues improvisation with LSTMrecurrent networks. In
NNSP , 2002.[11] Alexei A. Efros and William T. Freeman. Image quilt-ing for texture synthesis and transfer. In
SIGGRAPH ,2001.[12] Sebastian Flossmann and Gerhard Widmer. Toward amultilevel model of expressive piano performance. In
ISPS , 2011. [13] Leon A. Gatys, Alexander S. Ecker, and MatthiasBethge. Image style transfer using convolutional neuralnetworks. In
CVPR , pages 2414–2423, 2016.[14] Eric Grinstein, Ngoc Q. K. Duong, Alexey Ozerov, andPatrick Pérez. Audio style transfer. In
ICASSP , pages586–590, 2018.[15] Gaëtan Hadjeres and François Pachet. DeepBach:a steerable model for Bach chorales generation. In
ICML , 2017.[16] Gaëtan Hadjeres, Jason Sakellariou, and François Pa-chet. Style imitation and chord invention in poly-phonic music with exponential families.
CoRR ,abs/1609.05152, 2016.[17] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, andAlexei A. Efros. Image-to-image translation with con-ditional adversarial networks. In
CVPR , pages 5967–5976, 2017.[18] Diederik P. Kingma and Jimmy Ba. Adam: A methodfor stochastic optimization.
CoRR , abs/1412.6980,2015.[19] Diederik P. Kingma and Max Welling. Auto-encodingvariational Bayes.
CoRR , abs/1312.6114, 2014.[20] Guillaume Lample, Alexis Conneau, Ludovic De-noyer, and Marc’Aurelio Ranzato. Unsupervised ma-chine translation using monolingual corpora only.
CoRR , abs/1711.00043, 2017.[21] Kyogu Lee and Malcolm Slaney. Acoustic chord tran-scription and key extraction from audio using key-dependent hmms trained on synthesized audio.
IEEETransactions on Audio, Speech, and Language Pro-cessing , 16:291–301, 2008.[22] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Un-supervised image-to-image translation networks. In
NIPS , 2017.[23] Wei-Tsung Lu and Li Su. Transferring the style of ho-mophonic music using recurrent neural networks andautoregressive models. In
ISMIR , 2018.[24] Iman Malik and Carl Henrik Ek. Neural translation ofmusical style.
CoRR , abs/1708.03535, 2017.[25] Matthias Mauch and Simon Dixon. PYIN: A funda-mental frequency estimator using probabilistic thresh-old distributions. In
ICASSP , pages 659–663, 2014.[26] Cory McKay. Automatic genre classification of MIDIrecordings. M.A. Thesis, McGill University, 2004.[27] Cory McKay and Ichiro Fujinaga. The Bodhidharmasystem and the results of the MIREX 2005 symbolicgenre classification contest. In
ISMIR , 2005.[28] Noam Mor, Lior Wolf, Adam Polyak, and Yaniv Taig-man. A universal music translation network.
CoRR ,abs/1805.07848, 2018.29] François Pachet and Pierre Roy. Non-conformant har-monization: the Real Book in the style of Take 6. In
ICCC , 2014.[30] Germán Ros, Laura Sellart, Joanna Materzynska,David Vázquez, and Antonio M. López. The SYN-THIA dataset: A large collection of synthetic imagesfor semantic segmentation of urban scenes. In
CVPR ,pages 3234–3243, 2016.[31] Jason Sakellariou, Francesca Tria, Vittorio Loreto, andFrançois Pachet. Maximum entropy models capturemelodic styles.
Scientific Reports , 2017.[32] Justin Salamon, Rachel M. Bittner, Jordi Bonada,Juan J. Bosch, Emilia Gómez, and Juan Pablo Bello.An analysis/synthesis framework for automatic F0 an-notation of multitrack datasets. In
ISMIR , 2017.[33] Ian Simon and Sageev Oore. Performance RNN:Generating music with expressive timing and dy-namics. Magenta Blog, 2017. https://magenta.tensorflow.org/performance-rnn .[34] Bob L. Sturm, João Felipe Santos, Oded Ben-Tal,and Iryna Korshunova. Music transcription mod-elling and composition using deep learning.
CoRR ,abs/1604.08723, 2016.[35] Christopher J. Tralie. Cover song synthesis by analogy.In
ISMIR , 2018.[36] Gül Varol, Javier Romero, Xavier Martin, NaureenMahmood, Michael J. Black, Ivan Laptev, and CordeliaSchmid. Learning from synthetic humans. In
CVPR ,pages 4627–4635, 2017.[37] Gerhard Widmer, Sebastian Flossmann, and MaartenGrachten. YQX plays Chopin.
AI Magazine , 30:35–48,2009.[38] Xuexiang Xie, Feng Tian, and Seah Hock Soon. Fea-ture guided texture synthesis (FGTS) for artistic styletransfer. In
DIMEA , 2007.[39] Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexan-der M. Rush, and Yann LeCun. Adversarially regular-ized autoencoders. In
ICML , 2018.[40] Jun-Yan Zhu, Taesung Park, Phillip Isola, andAlexei A. Efros. Unpaired image-to-image translationusing cycle-consistent adversarial networks. In
ICCV ,pages 2242–2251, 2017.[41] Aymeric Zils and François Pachet. Musical mosaic-ing. In