[PDF] Generating Music with a Self-Correcting Non-Chronological Autoregressive Model

Abstract

We describe a novel approach for generating music using a self-correcting, non-chronological, autoregressive model. We represent music as a sequence of edit events, each of which denotes either the addition or removal of a note---even a note previously generated by the model. During inference, we generate one edit event at a time using direct ancestral sampling. Our approach allows the model to fix previous mistakes such as incorrectly sampled notes and prevent accumulation of errors which autoregressive models are prone to have. Another benefit is a finer, note-by-note control during human and AI collaborative composition. We show through quantitative metrics and human survey evaluation that our approach generates better results than orderless NADE and Gibbs sampling approaches.

Full PDF

GGENERATING MUSIC WITH A SELF-CORRECTINGNON-CHRONOLOGICAL AUTOREGRESSIVE MODEL

Wayne Chi ∗ Prachi Kumar ∗ Suri Yaddanapudi Rahul Suresh Umut Isik ∗ Equal Contribution Amazon Web Services [email protected], [email protected], [email protected]@amazon.com, [email protected]

ABSTRACT

We describe a novel approach for generating musicusing a self-correcting, non-chronological, autoregressivemodel. We represent music as a sequence of edit events,each of which denotes either the addition or removal of anote—even a note previously generated by the model. Dur-ing inference, we generate one edit event at a time usingdirect ancestral sampling. Our approach allows the modelto ﬁx previous mistakes such as incorrectly sampled notesand prevent accumulation of errors which autoregressivemodels are prone to have. Another beneﬁt is a ﬁner, note-by-note control during human and AI collaborative com-position. We show through quantitative metrics and humansurvey evaluation that our approach generates better resultsthan orderless NADE and Gibbs sampling approaches.

1. INTRODUCTION

There have been two primary approaches to generatingmusic with deep neural network-based generative models.In the ﬁrst class, music generation is essentially treated asan image generation problem [1, 2]. In the second class,music generation is treated as a musical time series gener-ation problem, analogous to autoregressive language mod-eling [3–7]. The human process of music composition,however, is often non-chronological. Notes can be ﬁlled inanytime throughout the music piece to create new chordsand melodies, add harmony, or embellish existing motifs.In this work, we propose ES-Net , a method that useselements from both the image-based and time series gen-eration techniques. Our method operates on piano rollimages with a 2D convolutional neural network, but au-toregressively adds or removes notes one at a time in anarbitrary, non-chronological order. We model the condi- Code: https://git.io/esnetSamples: https://git.io/esnet-samplesc (cid:13)

Wayne Chi, Prachi Kumar, Suri Yaddanapudi, RahulSuresh, Umut Isik. Licensed under a Creative Commons Attribution 4.0International License (CC BY 4.0).

Attribution:

Wayne Chi, PrachiKumar, Suri Yaddanapudi, Rahul Suresh, Umut Isik, “Generating Mu-sic with a Self-Correcting Non-Chronological Autoregressive Model”, in

Proc. of the 21st Int. Society for Music Information Retrieval Conf.,

Montréal, Canada, 2020. tional distribution of note add or remove events given pre-existing notes. After sampling from the distribution, were-input the resulting piano roll into the model to get thedistribution of the next add and remove events. From aprobabilistic point of view, this corresponds to consider-ing each piano roll as obtained from a randomly orderedsequence of add and remove events and autoregressivelymodeling the distribution of such sequences of events.Poor samples due to accumulation of errors is a well-documented problem with autoregressive models [8–11],especially when directly sampling from the conditionaldistribution (i.e. direct ancestral sampling). While othersampling techniques such as Gibbs sampling [12] can beused to bypass this problem, we show that direct ancestralsampling is sufﬁcient if the data representation includes re-moval of past samples. This allows the model to detectprevious mistakes and ﬁx them.Our primary use case is melody assistance for usersgenerating musical compositions. Users can feed in amelody as a conditional input and have the model gener-ate musical accompaniments as well as ﬁx any off-beat orout of tune inputs. One distinct advantage of our approachis that it allows note-by-note control for users. A user canundo and redo the generation of individual notes or explic-itly add and remove individual notes to collaborate withthe model and guide the music composition process. Thus,this approach allows users to have a ﬁner degree of con-trol during sampling and better promotes human and AIcollaboration.The remainder of the paper is organized as follows. InSection 2 we discuss related works. In Section 3 we showhow to model a distribution of musical pieces using a newrepresentation of music. In Section 4 we discuss our train-ing procedure. In Section 5 we discuss our sampling pro-cedure. In Section 6 we provide empirical results in theform of quantitative metrics and human evaluation com-pared against other approaches. Finally, in Section 7 and 8we describe future work and conclusions.

2. RELATED WORK

Following the introduction of NADE [13, 14] and order-less NADE [15] there have been several works built uponthe concept of ordered and unordered autoregressive mod-els. Coconet—the algorithm behind Google’s Bach Doo- a r X i v : . [ ee ss . A S ] A ug le —is a machine learning model that also uses a convo-lutional model to generate music by adding counterpointsto existing user input [12]. The difference with this workis that Coconet’s inference uses Gibbs sampling rather thandirect ancestral sampling. DeepBach [16] generates Bachstyle chorales using pseudo-Gibbs sampling. PixelCNN[17] models an image autoregressively and generates pix-els one by one in a pre-speciﬁed order while our generationis unordered. In a NLP setting, recent works also explorenon left-to-right ordering [18, 19] and deletion [20].In general, there is a rich history of using deep learn-ing to generate music [21]. Many of them use autore-gressive based approaches. RNN-RBM models tempo-ral dependencies to generate polyphonic music in a singletrack [22]. Hierararchical RNNs have been used to encodedifferent features of pop music [23]. LSTMs were ableto successfully model and generate music as well [24].Music Transformer is able to capture and generate mu-sic with long term structure and motifs [3]. So far, theseapproaches have been mostly chronological while ours isnon-chronological. While GAN-based approaches clearlydiffer from ours, these methods have shown the abilityto generate high quality music. MuseGAN is a GAN-based approach for multi-track piano roll generation [1].MidiNet uses a CNN-based GAN to generate music [2]. C-RNN-GAN generates music using a RNN based architec-ture with adversarial training [25]. SeqGAN use GANs forsequence generation and apply it to music generation [26].

3. PROBLEM DEFINITION

We consider a musical piece x ∈ X as a point in { , } T × P where T is the number of time steps and P isthe number of note pitches. This represents a simpliﬁed pi-ano roll (PR)—a discrete representation of music as an im-age matrix across pitch and time. There exists a probabilitydensity function p PR ( x ) on { , } T × P of musical pieces.Note in particular that this does not model velocity andthat notes adjacent in time are treated as one continuousheld note; we discuss ways to represent velocity and re-peated notes in Section 7. Instead of modeling p PR ( x ) on { , } T × P directly, we model the distributions as p ES ( s ) on the set of edit sequences (ES). An edit sequence oflength M is a tuple of M -many edit events where an editevent is a matrix e ( t,p ) ∈ { , } T × P that has one entryequal to one, and all other entries equal to zero (i.e. a one-hot matrix). We denote the set of all edit events by E andof edit sequences of length M by E M . The following mapsedit sequences to piano rolls: π : ∞ (cid:91) M =1 E M → { , } T × P (1) π ( e , . . . , e M ) = M (cid:88) i =1 e i (mod 2) . (2) https://magenta.tensorﬂow.org/coconet where (2) allows edit events to handle either note additionor removal depending on if a previous edit event exists atthe same time and pitch. Figure 1 . Mapping from an edit sequence (left) of lengthM to a piano roll (right). Each slice in an edit sequence isthe addition or removal of a note.The mapping between the two joint probability distri-butions is as follows: p PR ( x ) = p PR ( { ( t , p ) , . . . , ( t N , p N ) } )= ∞ (cid:88) s ∈ π − ( x ) p ES ( s ) (3)where N is the number of notes in the piano roll, ( t i , p i ) isthe time and pitch of a note or edit event, π − ( x ) is the in-verse image set of π ( x ) , and s is a sequence of edit events ( t , p ) . . . ( t M , p M ) where M ≥ N . We can further fac-torize p ES ( s ) as: p ES ( s ) = p ES (cid:0) ( t , p ) , . . . , ( t M , p M ) (cid:1) = M (cid:89) i =1 p ES (cid:0) ( t i , p i ) | ( t , p ) , . . . , ( t i − , p i − ) (cid:1) (4)We assume that p ES (( t i , p i ) | ( t , p ) , . . . , ( t i − , p i − )) isordering invariant (i.e. the ordering of edit events in anedit sequence does not affect the resulting piano roll).Our goal is to train a model to map the distribution ofedit sequences p ES ( s ) . By sampling autoregressively from p ES ( s ) , we will generate a sequence of edit events that canbe mapped back into a piano roll representation and thenconverted to MIDI. We compare our approach to orderless NADE which gen-erates music by randomly choosing an ordering and sam-pling notes one by one until termination. We can rep-resent the iterative notewise addition of orderless NADEas a special case of edit sequences where edit events canonly represent note addition. Let us call this distribution p O-NADE ( x ) . Since notes are only added, M = N forunconditioned generation; thus, there is a ﬁnite set of or-derings and we can factorize p O-NADE ( x ) as: (cid:88) σ ∈ S N N (cid:89) i =1 p O-NADE (cid:0) ( t σ ( i ) , p σ ( i ) ) | ( t σ (1) , p σ (1) ) , ..., ( t σ ( i − , p σ ( i − ) (cid:1) here S N is the set of all permutations { , , ..., N } →{ , , ..., N } . This factorization is equivalent to orderlessNADE [15]. In practice, the orderless NADE approachleads to poorer musical samples due to accumulation oferrors which we conﬁrm in Section 6.

4. TRAINING

Given an input piano roll, I , and target piano roll, T ,we train our model to output the conditional probabili-ties p ES (( t i , p i ) | ( t , p ) , . . . , ( t i − , p i − )) of the next editevent in edit sequences that can recreate the target from theinput piano roll. For each piano roll in the training set, wegenerate the input piano roll by (a) masking existing ran-dom notes and (b) adding extraneous random notes to thetarget piano roll. We train the model to recreate T from I ; augmentation (a) trains the model to add notes and (b)trains the model to remove notes. For each target pianoroll, we generate multiple augmented inputs with varyingnumber of notes masked and added, in order to train themodel to handle varying number of differences betweenthe input and target. We ﬁnd that masking between 0 to 100percent of all existing notes and adding 0 to 1.5 percent ofall possible extraneous notes gives us the best results.Our goal is to have the model output the conditionalprobabilities for the next edit event. Since we assume or-dering invariance in (4), we can also assume that every notedifference between I and T —whether it requires the addi-tion or removal of a note—is equally likely to be the nextedit event. Thus, we model the distribution of edit eventsfor the next step as the uniform distribution U supported onthe symmetric difference I ∆ T between I and T (i.e theexclusive or of each note between I and T ).We use the Kullback-Liebler divergence between U andthe model’s output distribution as the loss function: L ( I , T , P ) = D KL ( P (cid:107) U ) , (5)where P is the softmax over the model’s logits at each timeand pitch. Normally, binary cross-entropy loss—where thelabel is the next note—would be used, but since we assumeordering invariance in (4), the next note is equally likely tobe any of the future notes. Therefore, training with (5) isequivalent to training many times where the label is ran-domly chosen from future notes. We train a model based on the U-Net architecture [27].This choice is not critical as our approach should general-ize to other CNN architectures. We describe our approachfor reproducibility. Our U-Net contains ﬁve downsamplingblocks and ﬁve upsampling blocks. In each block therecontains a batch normalization layer, two 2D convolutionallayers each with a 3x3 kernel, a max pooling layer, and adrop out layer with a 0.5 dropout rate. We begin with 32ﬁlters. We double the number of ﬁlters after each down-sampling block and halve the number of ﬁlters after eachupsampling block. We use the Adam optimizer [28] witha learning rate of 0.001. We use RELU for our activation function, except for the ﬁnal layer where we output a linearactivation at each time and pitch. Finally, we apply soft-max over the logits when calculating the loss and duringsampling.

5. INFERENCE

We sample from the model’s output probabilities throughdirect ancestral sampling. We feed the input melody to themodel, sample from the softmax over all times and pitchesto determine the next edit event, modify the input melodybased on that edit event, and then feed that melody backinto the model. We repeat this over multiple iterations andcondition each time on our previous predictions. Since wedo not differentiate between adding and removing notesduring training, the sampling process is the same for anytype of edit event. We allow users to restrict the number ofnotes to remove; this prevents the model from completelyoverwriting the original input. We also allow users to con-trol how many sampling iterations are performed. Lastly,we allow the user to change the temperature during sam-pling. By changing the shape of the distribution, users canmake compositions more or less “creative” at the risk oflowering quality. We surface these hyperparameters to al-low users to more freedom and customizability when gen-erating music compositions.

6. EMPIRICAL EVALUATION

We compare our approach against orderless NADE andGibbs sampling using quantitative metrics and human sur-vey evaluation. We also describe a notewise approximatelog likelihood calculation for our approach and explainwhy log likelihood is not a good metric for comparingour approach to orderless NADE. We build an orderlessNADE model using the approach described in Section 3.1and training with only masked notes. We use Coconet [12]to represent Gibbs sampling. While our main focus isto only use Coconet for sampling technique comparisons,there are a few notable differences between Coconet andour approach. First, Coconet does not explicitly train themodel to remove notes, but notes—including the input—may be removed during the Gibbs sampling masking pro-cess; our approach explicitly models note removal. Sec-ond, Coconet assumes that there are four instruments andthat “each instrument plays exactly one pitch a time” [12];our approach has no such constraint and can generate mu-sic across all times and pitches. Third, Coconet trains aCNN that preserves the same size for each layer; we traina model based on the U-Net architecture. Since Coconetis trained on the JSB Chorales dataset, we evaluate our re-sults and Orderless NADE’s results using the same datasetand the same train-val-test split in order to provide a faircomparison. For all other parameters (e.g. temperature),we maintain identical settings for each approach in orderto benchmark fairly. igure 2 . Input piano roll (left), target piano roll (middle), and symmetric difference between the input and target pianorolls (right)

We use the Inﬁnite Bach dataset and the JSB Choralesdataset . The JSB Chorales and Inﬁnite Bach datasetscontain MIDI ﬁles—382 for JSB Chorales and 498 for In-ﬁnite Bach—of chorales harmonized by J.S. Bach. TheMIDI ﬁles in Inﬁnite Bach dataset are generally longer induration allowing for approximately three times more sam-ples overall compared to the JSB Chorales dataset. Sincethe Gibbs sampling model is trained on JSB Chorales, weuse the JSB Chorales dataset for benchmarking.For both datasets, we preprocess the data by: 1) map-ping MIDI to its piano roll representation using a sixteenth-note quantization, 2) converting multi-track inputs into asingle track by merging all tracks, and 3) splitting eachMIDI into multiple 2 or 8 bar samples. We calculate log likelihood using equation (3). Since foreach piano roll x the inverse image π − ( x ) is inﬁnite, thesum cannot be calculated exactly; thus, we calculate anapproximate log likelihood for a subset of all possible editsequences in π − ( x ) . This value lower bounds the truelog likelihood value. We compare this lower bound to thelog likelihood for orderless NADE. Since our method re-moves notes as well, the proposed model is modeling adistribution with larger support so we do not expect thelikelihood value of our method to be better than orderlessNADE’s. Our likelihood values show that—in the toy casewhen the sum can be sufﬁciently expanded—the likelihoodlower bound value approaches that of orderless NADE.Consider a graph where each vertex corresponds to a pi-ano roll state and each edge corresponds to an edit event. Apath in the graph corresponds to an edit sequence describedin equation (2). As we traverse over a path, we calculatethe log likelihood of the edit sequence corresponding tothat path.For each input I and target T pairing, we calculate ourlog likelihood over multiple levels, traversing over edit se-quences of length K + 2 d at level d . K is the minimumnumber of edit events needed to reach the target from theinput. All K edit events are unique along time and pitch.For level d = 0 , there exist K ! different edit sequences. https://github.com/jamesrobertlloyd/inﬁnite-bach https://github.com/czhuang/JSB-Chorales-dataset We calculate the average log likelihood over a randomlychosen subset of these edit sequences and approximate itover all K ! edit sequences. During the traversal we keeptrack of the most probable (time, pitch) predictions that donot occur in the edit sequences, and add them to a pool Q . We keep these predictions as they will appear in themost probable edit sequence paths at level d = 1 . For level d = 1 , we traverse down the same paths, but we add twoedit events with the same time and pitch chosen from Q to the path. This increases the path length to K + 2 andresults in the same target pianoroll since the two new editevents cancel out. We approximate the log likelihood sumover all possible edit sequences. We repeat this for each(time, pitch) pair in Q . This process can be repeated untillevel d = D expanding our coverage of the edit sequencegraph along the most probable paths.We calculate the approximate log likelihood as: K log D (cid:88) d =0 (cid:88) Q ( K + 2 d )!2 d | S | (cid:88) s ∈ S p ES ( s ) where S is a random subset of K +2 d length edit sequencesthat can transform I to T . As we increase the levelsof our approximation, our log likelihood will converge to-wards orderless NADE which we see in Table 1 at d = 1 .Since music completion is a task with high uncertainty,the large number of low probability predictions leads tounderﬂow issues, which we avoid by using the log-sum-exp trick. Also, since log likelihood in this case is highlydependent on the number of notes in a piece, we computean approximate notewise log likelihood by dividing the ap-proximate log likelihood by the minimum number of noteadditions and removals needed to reconstruct the target pi-anoroll. We do not use log likelihood to compare our ap-proach with Gibbs sampling used in Coconet as they useframewise log likelihood, which is different than our cal-culation [12]. We calculate several quantitative metrics to compare thequality of generated music using our approach, orderlessNADE, and Gibbs sampling. For each approach, we gen-erate 3405 bars of music—the same number of bars in the We divide the K +2 d factorial by d as we cannot “remove" beforewe “add" a note. pproach Notewise Approximate Log LikelihoodES-Net -0.635Orderless NADE -0.558 Table 1 . Notewise approximate log likelihood for recon-structing 10 missing notes from each test sample.training data—and compare them to the training data. Wegenerate the music by conditioning on 150 8-bar mono-phonic inputs. We evaluate on the following metrics de-signed in [1, 29]: • PC - Number of unique pitch classes used. Noteswhole octaves apart from each other (e.g. C4 andC5) belong to the same pitch class. • P - Number of unique pitches used. • ISR - In-scale rate which is the proportion of allnotes that lie in C Major . • PR - Polyphonic rate which is the proportion oftimesteps where the number of pitches being playedis greater than or equal to 4.We use pypianoroll [29] to calculate these values.

PC P ISR PRTraining Data 6 46 0.541 0.917ES-Net

Gibbs Sampling

Table 2 . Quantitative metrics for each approach. Closerto training data is better. Bold values are best between thethree approaches.We observe that our approach and the Gibbs samplingapproach both produce music that have similar characteris-tics to the dataset, while orderless NADE shows less sim-ilarity to the dataset. As seen in Table 2, our approach isthe closest for all four metrics, with Gibbs sampling tyingfor number of unique pitches and pitch classes used.

Bhattacharyya Kolmogorov-Smirnovdf D pES-Net 0.028 46 0.17 0.49Gibbs Sampling

46 0.13 0.83Orderless NADE 0.049 46 0.17 0.49

Table 3 . Various metrics for how far pitch appear-ance frequency is from the training data. Lower is bet-ter and bolded is best for Bhattacharyya distance. TheKolmogorov-Smirnov test is unable to show signiﬁcantdifference between any of the approaches and the trainingdata. The C-Major scale was chosen arbitrarily.

Figure 3 . Frequency of occurrence for each pitch bin.Each bin is two pitches (i.e. one bin contains both pitch 31and 32).In Figure 3, we plot the frequency of pitch values foreach approach and compare with the distribution of pitchesin the training data. We observe that the distribution ofpitches for all three approaches is very similar to that ofthe training data. In Table 3, we evaluate the similarity ofeach approach’s pitch appearance frequency to the trainingdata using various metrics. We calculate the Bhattacharyyadistance [30] showing Gibbs sampling as the closest to thetraining data and orderless NADE as the furthest from thetraining data. We perform Kolmogorov-Smirnov tests andare unable to show signiﬁcant differences between each ap-proach and the training data.

We conducted a human opinion test in order to compareour approach against orderless NADE and Gibbs sampling.We generated 8 bar samples with a pitch range from 36 to81. We assume 4/4 time (i.e. 4 beats per bar) and quantizeto 16 time steps per bar (i.e. 1/16th note). We assume twonotes continuous in time as one note. For orderless NADE,we sample 400 times to generate samples that approximateto 4 pitches per time step. Coconet optimizes the num-ber of iterations it requires. Since our approach allows forthe model to both add and remove notes, there is no ﬁxednumber of iterations to run the model; instead, the modeleventually stabilizes and adds or removes the same set ofnotes repeatedly. We sample for 10,000 iterations for ourapproach. When conducting the surveys, we chose a largenumber of iterations to ensure stabilization; through laterexperiments, however, we found that the output almost al-ways stabilizes before 2000 iterations.Each survey contained ﬁfteen randomly chosen sets ofcomparisons where each set of comparisons contained arandom sample from each of the three approaches. Eachof the samples in each set were randomly ordered. Allthree samples in a set were conditioned on the same inputtrack which was also given to the participant. In order tosimulate real user input, we created input tracks by takingtwo bar user inputs from the Bach Doodle Dataset [31]—a dataset of real user inputs to Coconet and its resultingcomposition—and repeated them four times to form eight a) (b) (c)

Figure 4 . Human Survey Evaluation Ratings: (a) describes whether users thought a sample improved on the input. (b)describes user rankings for music quality. (c) describes user rankings for how similar a sample is to real Bach data. Barsare ordered from left-to-right as ES-Net, Gibbs Sampling, and Orderless NADE.bars. Bach Doodle ranks their inputs based off of user feed-back from the resulting Coconet composition; we chosean equal number of samples randomly from each feedbacklevel. Each survey contained the same ﬁfteen sets of com-parisons. These inputs are monophonic (i.e. only one pitchper time step). For each set of comparisons, users wereasked (a) if each sample improved on the input, (b) to rankthe samples based on music quality, and (c) to rank thesamples based on similarity to music composed by Bach.We receive a total of 207 ratings for question (a), 211ratings for question (b), and 213 ratings for question (c). For question (a), we see that all approaches are compara-ble and each approach almost always improved the inputas seen in Figure 4(a). For questions (b) and (c), we seein Figures 4(b) and 4(c) that our edit sequence approachis the best approach while the orderless NADE approachis the worst. We perform a Kruskal-Wallis H-test acrossall ratings for questions (b) and (c). We show that thereis a statistically signiﬁcant difference ( X (2) = 64 . , p < . for question (a) and X (2) = 73 . , p < . for question (b)) between the three models. We use theWilcoxon signed-rank test to conduct a pairwise post-hocanalysis. We show that there is a statistically signiﬁcantdifference ( p < . for questions (b) and (c)) betweenour approach and both the Gibbs sampling and orderlessNADE approaches.

7. FUTURE WORK

We currently trained on a limited number of datasets, bothof which are based on Bach chorales. There is no reason,however, that our approach should be limited to any featureof Bach. By training on other datasets, we will be able toevaluate how well our approach generalizes.We show that allowing the model to remove notes in-creases music quality which we believe is due to the modelcorrecting its past mistakes. During our training process,we generate random notes in order to mimic those mis-takes. Rather than merely mimicking those mistakes, how-ever, we can generate real mistakes by feeding outputs Some users did not answer all three questions per set of samples.Partial or incomplete rankings were discarded. from the model back into itself. We believe that this self-adversarial training paradigm will allow the model to cap-ture more realistic sampling mistakes and further improveperformance.Our current data representation does not convey fea-tures such as note velocity, repeated notes, or explicit noteduration. These features, however, can add to the technicaland emotive quality of music. We can map these new fea-tures as additional channels and concatenate this informa-tion with our existing piano roll. This new data represen-tation will allow our model to learn from these new featuredimensions and produce more expressive and technicallychallenging music.An advantage of our algorithm is the ease with whichwe can extend our approach to other use cases. For in-stance, currently our model generates ﬁxed length outputsdepending on the length of the training samples. In thisway, we can extend user melodies up to a ﬁxed length;however, we never explicitly train our model to extend in-puts. By augmenting our dataset so that the latter portionof each sample is masked out, we can explicitly train ourmodel to extend melodies. Then, during sampling, we cangenerate a ﬁxed length output, feed the latter portion ofthat output back into the model to generate a new output,concatenate those two outputs together, and repeat. Thiswould allow us to extend melody repeatedly rather than upto a ﬁxed length output.

8. CONCLUSION

We show that by modeling removal of notes, we can train amodel to produce better music by ﬁxing past mistakes andpreventing accumulation of errors. We discuss how ournote-by-note approach allows for a ﬁner degree of controland better human and AI collaboration. We demonstratehow to map an edit sequence representation into a pianoroll representation and how we can use that to model a dis-tribution of musical pieces. We discuss how we train ourmodel by masking and adding erroneous notes and how wesample from our model during inference. Finally, we showthrough quantitative metrics and human evaluation that ourapproach is able to generate musical compositions that aref better quality than orderless NADE and Gibbs sampling.

9. REFERENCES [1] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang,“Musegan: Multi-track sequential generative adversar-ial networks for symbolic music generation and accom-paniment,” in

Thirty-Second AAAI Conference on Arti-ﬁcial Intelligence , 2018.[2] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet:A convolutional generative adversarial network forsymbolic-domain music generation,” arXiv preprintarXiv:1703.10847 , 2017.[3] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon,C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman,M. Dinculescu, and D. Eck, “Music transformer: Gen-erating music with long-term structure,” 2018.[4] R. Child, S. Gray, A. Radford, and I. Sutskever,“Generating long sequences with sparse transformers,” arXiv preprint arXiv:1904.10509 , 2019.[5] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, andR. Salakhutdinov, “Transformer-xl: Attentive languagemodels beyond a ﬁxed-length context,” arXiv preprintarXiv:1901.02860 , 2019.[6] C. Payne, “Musenet, 2019,”

URL https://openai.com/blog/musenet , 2019.[7] J. Wu, C. Hu, Y. Wang, X. Hu, and J. Zhu, “A hier-archical recurrent neural network for symbolic melodygeneration,”

IEEE Transactions on Cybernetics , 2019.[8] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer,“Scheduled sampling for sequence prediction with re-current neural networks,” in

Advances in Neural Infor-mation Processing Systems , 2015, pp. 1171–1179.[9] F. Huszár, “How (not) to train your generative model:Scheduled sampling, likelihood, adversary?” arXivpreprint arXiv:1511.05101 , 2015.[10] A. M. Lamb, A. G. A. P. Goyal, Y. Zhang, S. Zhang,A. C. Courville, and Y. Bengio, “Professor forcing:A new algorithm for training recurrent networks,” in

Advances In Neural Information Processing Systems ,2016, pp. 4601–4609.[11] A. Venkatraman, M. Hebert, and J. A. Bagnell, “Im-proving multi-step prediction of learned time seriesmodels,” in

Twenty-Ninth AAAI Conference on Artiﬁ-cial Intelligence , 2015.[12] C.-Z. A. Huang, T. Cooijmans, A. Roberts,A. Courville, and D. Eck, “Counterpoint by con-volution,” arXiv preprint arXiv:1903.07227 , 2019.[13] H. Larochelle and I. Murray, “The neural autoregres-sive distribution estimator,” in

Proceedings of the Four-teenth International Conference on Artiﬁcial Intelli-gence and Statistics , 2011, pp. 29–37. [14] B. Uria, M.-A. Côté, K. Gregor, I. Murray, andH. Larochelle, “Neural autoregressive distribution es-timation,”

The Journal of Machine Learning Research ,vol. 17, no. 1, pp. 7184–7220, 2016.[15] B. Uria, I. Murray, and H. Larochelle, “A deep andtractable density estimator,” in

International Confer-ence on Machine Learning , 2014, pp. 467–475.[16] G. Hadjeres, F. Pachet, and F. Nielsen, “Deepbach:a steerable model for bach chorales generation, june2017,” arXiv preprint arXiv:1612.01010 .[17] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu,“Pixel recurrent neural networks,” arXiv preprintarXiv:1601.06759 , 2016.[18] M. Stern, W. Chan, J. Kiros, and J. Uszkoreit, “Inser-tion transformer: Flexible sequence generation via in-sertion operations,” arXiv preprint arXiv:1902.03249 ,2019.[19] W. Chan, N. Kitaev, K. Guu, M. Stern, and J. Uszkor-eit, “Kermit: Generative insertion-based modeling forsequences,” arXiv preprint arXiv:1906.01604 , 2019.[20] J. Gu, C. Wang, and J. Zhao, “Levenshtein trans-former,” in

Advances in Neural Information ProcessingSystems , 2019, pp. 11 181–11 191.[21] J.-P. Briot, G. Hadjeres, and F.-D. Pachet, “Deep learn-ing techniques for music generation–a survey,” arXivpreprint arXiv:1709.01620 , 2017.[22] N. Boulanger-Lewandowski, Y. Bengio, and P. Vin-cent, “High-dimensional sequence transduction,” in . IEEE, 2013, pp. 3178–3182.[23] H. Chu, R. Urtasun, and S. Fidler, “Song from pi: Amusically plausible network for pop music generation,” arXiv preprint arXiv:1611.03477 , 2016.[24] B. L. Sturm, J. F. Santos, O. Ben-Tal, andI. Korshunova, “Music transcription modelling andcomposition using deep learning,” arXiv preprintarXiv:1604.08723 , 2016.[25] O. Mogren, “C-rnn-gan: Continuous recurrent neu-ral networks with adversarial training,” arXiv preprintarXiv:1611.09904 , 2016.[26] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Se-quence generative adversarial nets with policy gradi-ent,” in

Thirty-First AAAI Conference on Artiﬁcial In-telligence , 2017.[27] O. Ronneberger, P. Fischer, and T. Brox, “U-net:Convolutional networks for biomedical image seg-mentation,” in

International Conference on Medicalimage computing and computer-assisted intervention .Springer, 2015, pp. 234–241.28] D. P. Kingma and J. Ba, “Adam: A method for stochas-tic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[29] H.-W. Dong, W.-Y. Hsiao, and Y.-H. Yang,“Pypianoroll: Open source python package forhandling multitrack pianoroll,”

Proc. ISMIR.Late-breaking paper;[Online] https://github.com/salu133445/pypianoroll , 2018.[30] A. Bhattacharyya, “On a measure of divergence be-tween two statistical populations deﬁned by theirprobability distributions,”

Bull. Calcutta Math. Soc. ,vol. 35, pp. 99–109, 1943.[31] C.-Z. A. Huang, C. Hawthorne, A. Roberts, M. Din-culescu, J. Wexler, L. Hong, and J. Howcroft, “TheBach Doodle: Approachable music composition withmachine learning at scale,” in