[PDF] PIANOTREE VAE: Structured Representation Learning for Polyphonic Music

Abstract

The dominant approach for music representation learning involves the deep unsupervised model family variational autoencoder (VAE). However, most, if not all, viable attempts on this problem have largely been limited to monophonic music. Normally composed of richer modality and more complex musical structures, the polyphonic counterpart has yet to be addressed in the context of music representation learning. In this work, we propose the PianoTree VAE, a novel tree-structure extension upon VAE aiming to fit the polyphonic music learning. The experiments prove the validity of the PianoTree VAE via (i)-semantically meaningful latent code for polyphonic segments; (ii)-more satisfiable reconstruction aside of decent geometry learned in the latent space; (iii)-this model's benefits to the variety of the downstream music generation.

Full PDF

PPIANOTREE VAE: STRUCTURED REPRESENTATION LEARNING FORPOLYPHONIC MUSIC

Ziyu Wang Yiyi Zhang Yixiao Zhang Junyan Jiang Ruihan Yang Junbo Zhao (Jake) Gus Xia Music X Lab, Computer Science Department, NYU Shanghai Center for Data Science, New York University Computer Science Department, Zhejiang University {ziyu.wang, yz2092, yixiao.zhang, jj2731, ry649, j.zhao, gxia}@nyu.edu

ABSTRACT

The dominant approach for music representation learninginvolves the deep unsupervised model family variationalautoencoder (VAE). However, most, if not all, viable at-tempts on this problem have largely been limited to mono-phonic music. Normally composed of richer modality andmore complex musical structures, the polyphonic coun-terpart has yet to be addressed in the context of musicrepresentation learning. In this work, we propose the Pi-anoTree VAE, a novel tree-structure extension upon VAEaiming to ﬁt the polyphonic music learning. The exper-iments prove the validity of the PianoTree VAE via (i)-semantically meaningful latent code for polyphonic seg-ments; (ii)-more satisﬁable reconstruction aside of decentgeometry learned in the latent space; (iii)-this model’s ben-eﬁts to the variety of the downstream music generation. Unsupervised machine learning has led to a marriage ofsymbolic learning and vectorized representation learning[1–3]. In the computer music community, the MusicVAE[4] enables the interpolation in the learned latent space torender some smooth music transition. The EC -VAE [5]manages to disentangle certain interpretable factors in mu-sic and also provides a manipulable generation pathwaybased on these factors. Pati et al . [6] further utilizes therecurrent networks to learned music representations forlonger-term coherence.Unfortunately, most of the success has been limited tomonophonic music. The generalization of the learningframeworks to polyphonic music is not trivial, due to itsmuch higher dimensionality and more complicated musi-cal syntax. The commonly-adopted MIDI-like event se-quence modeling or the piano-roll formats fed to either re- Code and demos can be accessed via https://github.com/ZZWaang/PianoTree-VAE c (cid:13) Ziyu Wang, Yiyi Zhang, Yixiao Zhang, Junyan Jiang,Ruihan Yang, Junbo Zhao (Jake), Gus Xia. Licensed under a CreativeCommons Attribution 4.0 International License (CC BY 4.0).

Attribu-tion:

Ziyu Wang, Yiyi Zhang, Yixiao Zhang, Junyan Jiang, Ruihan Yang,Junbo Zhao (Jake), Gus Xia, “PIANOTREE VAE: Structured Represen-tation Learning for Polyphonic Music”, in

Proc. of the 21st Int. Societyfor Music Information Retrieval Conf.,

Montréal, Canada, 2020.

Figure 1 : An illustration of the proposed polyphonic syn-tax.current or convolutional networks have fell short in learn-ing good representation, which usually leads to unsatisﬁedgeneration results [7–9]. In this paper, we hope to pioneerthe development of this challenging task. To begin with,we conjecture a proper set of inductive bias for the desiredframework: (i)-a sparse encoding of music as the model in-put; (ii)-a neural architecture that incorporates the hierar-chical structure of polyphonic music (i.e., musical syntax).Guided by the aforementioned design principles, wepropose PianoTree VAE, a hierarchical representationlearning model under the VAE framework. We adopt a treestructured musical syntax that reﬂects the hierarchy of mu-sical concepts, which is shown in Figure 1. In a top-downorder: we deﬁne a score (indicated by the yellow rect-angle) as a series of simu_note events (indicated by thegreen rectangles), a simu_note as multiple note eventssharing the same onset (indicated by blue rectangles), andeach note has several attributes such as pitch and dura-tion . In this paper, we focus on a simple yet common formof polyphonic music—piano score, in which each note hasonly pitch and duration attributes. For future work, thissyntax can be generalized to multiple instruments and ex-pressive performance by adding extra attributes such asvoice, expressive timing, dynamics, etc.The whole neural architecture of PianoTree VAE canbe seen as a tree. Each node represents the embedding ofeither a score , simu_note , or note , where a higher levelrepresentation has larger receptive ﬁelds. The edges arebidirectional where a recurrent module is applied to eitherencode the children into the parent or decode the parent togenerate its children.Through extensive evaluations, we show that PianoTreeVAE yields semantically more meaningful latent represen-tations and further downstream generation quality gains,on top of the current state-of-the-art solutions. a r X i v : . [ ee ss . A S ] A ug Related Work

The complex hierarchical nature of music data has beenstudied for nearly a century (e.g. GTTM [10], Schenk-lerian Analysis [11], and their follow-up works [12–15]).However, the emerging deep representation-learning mod-els still lack the compatible solutions to deal with the com-plex musical structure. In this section, we ﬁrst review dif-ferent types of polyphonic music generation in Section 2.1.After that, we discuss some popular deep music generativemodels indexed by their compatible data structure fromSection 2.2 to Section 2.4.

In the context of deep music generation, polyphony canrefer to three types of music: 1) multiple monophonic parts(e.g., a four-part chorus), 2) a single part of a polyphonicinstrument (e.g., a piano sonata), and 3) multiple parts ofpolyphonic instruments (e.g., a symphony).The ﬁrst type of polyphonic music can be created bysimply extending the number of voices in monophonic mu-sic generation with some inter-voice constraints. Somerepresentative systems belonging to this category includeDeepBach [16], XiaoIce [17], and Coconet [18]. MusicTransformer [19] and the proposed PianoTree VAE bothfocus on the generation of the second type of polyphony,which is a much more difﬁcult task. Polyphonic pieces un-der the second deﬁnition no longer have a ﬁxed number of“voices” and consist of more complex textures. The thirdtype of polyphony can be regarded as an extension of thesecond type, and we leave it for future work.

Piano-roll and its variations [7, 20–22] view polyphonicmusic as 3-D (one-hot) tensors, in which the ﬁrst two di-mensions denote time and pitch and the third dimensionindicates whether the token is an onset, sustain or rest. Acommon way for deep learning models to encode/decodea piano-roll is to use recurrent layers along the time-axiswhile the pitch-axis relations are modeled in various ways[20, 21, 23]. Another method is to regard a piano-roll asan image with three channels (onset, sustain and rest) andapply convolutional layers [7, 22].Through the proposal of PianoTree VAE, we argue thata major way to improve the current deep learning modelsis to utilize the built-in priors (intrinsic structure) in themusical data. In our work, we primarily use the sparsityand the hierarchical priors.

MIDI-like event sequence is ﬁrst used in deep music gen-eration in performanceRNN [24] and Multi-track Music-VAE [9], and then broadly applied in transformer-basedgeneration [19, 25, 26]. This direction of work leveragesthe sparsity of polyphonic data to efﬁciently ﬂatten poly-phonic music into an array of events. The vocabularysize of events usually tripples the vocabulary size of MIDIpitches, including “note-on” and “note-off” events for 128MIDI pitches, “time shifts”, and so on. However, the format of MIDI-like events lacks theproper ﬂexibility. A few operations are made difﬁcult dueto its very nature. For instance, during addition or dele-tion of notes, often numerous “time shift” tokens must bemerged or split with the “note-on” or “note-off” tokens be-ing changed all-together. This has caused the model be-ing trained inefﬁcient for the potential generation tasks.In addition, this format has a risk of generating illegal se-quences, say a “note on” message without a paired “noteoff” message.Similarly, we see the note-based approaches [27, 28],in which polyphonic music is represented as a sequenceof note tuples, as an alternative to the MIDI-like meth-ods. The representation has resolved the illegal genera-tion problem but still not revealed much of the intrinsicmusic structure. We argue that our work improves on thenote-based approaches by utilizing deeper musical struc-tures implied by the data. (See Section 3.1 for details.)

Recently, we see a trend in using graph neural networks(GNN) [29] to represent polyphonic score [30], in whicheach vertex represents a note and the edges represent dif-ferent musical relations. Although the GNN-based modeloffers sparse representation learning capacity, it is limitedby the speciﬁcation of the graph structure design and it isnontrivial to generalize it for score generations.

We ﬁrst deﬁne a data structure to represent a polyphonicmusic segment, which contains two components: 1) sur-face structure , a data format to represent the music obser-vation, and 2) deep structure , a tree structure (containing score , simu_note and note nodes) showing the syntac-tic construct of the music segment.Each music segment lasts T time steps with beat asthe shortest unit. We further use K t , where ≤ t ≤ T todenote the number of notes having the same onset t . Thecurrent model uses T = 32 , i.e., each music segment is8-beat long. The surface structure is a nested array of pitch-duration tu-ples, denoted by { ( p t,k , d t,k ) | ≤ t ≤ T, ≤ k ≤ K t } .Here, ( p t,k , d t,k ) is the k th lowest note starting at timestep t . The pitch attribute p t,k is a 128-D one-hot vec-tor corresponding to 128 MIDI pitches. The duration at-tribute d t,k encodes the duration ranging from 1 to T usinga log T -bit binary vector. For example, when T = 32 ( log T = 5 ), ‘00000’ represents a th note, ‘00001’ is an th note, ‘00010’ is a dotted th note, and so on so forth.The base-2 design is inspired by the similar binary relationamong different note values in western musical notation.The bottom part of Figure 2 illustrates the surface struc-ture of the music example in Figure 1. We see that the datastructure is a sparse encoding of music, and it eliminatesillegal tokens since every possible nested array has a cor-respondent music. igure 2 : An illustration of PianoTree data structure toencode the music example in Figure 1. We further build a syntax tree to reveal the hierarchicalrelation of the observation. First, for ≤ t ≤ T, ≤ k ≤ K t , we deﬁne note t,k as the summary (i.e., em-bedding) of ( p t,k , d t,k ) , which are the bottom layers of thetree. Then, for ≤ t ≤ T , we deﬁne simu_note t as thesummary of note t, ≤ k ≤ K t , which are the middle layers ofthe tree. Finally, we deﬁne the score as the summary of simu_note ≤ t ≤ T , which is the root of the tree. The upperpart of Figure 2 illustrates the deep structure built upon itssurface structure.The syntax tree, so-called the deep structure has bothmusical and linguistic consideration. In terms of music, note , simu_note and score roughly reﬂect the musi-cal concept of a note, chord and grouping. In terms oflinguistics, the tree is analogous to a constituency tree,with surface structure being the terminal nodes and deepstructure being the non-terminals. Recent studies in nat-ural language processing have revealed that incorporatingnatural language syntax results in better semantics model-ing [31, 32]. We use the surface structure of polyphonic music as themodel input. The VAE architecture is built upon the deepstructure.We denote the music segment in the proposed surfacestructure as x and the latent code as z , which conforms toa standard Gaussian prior denoted by p ( z ) . The encodermodels the approximated posterior q φ ( z | x ) in a bottom-up order of the deep structure. First, note embeddingsare computed through a linear transform of pitch-durationtuples. Second, the note embeddings (sorted by pitch)are then embedded into simu_note using a bi-directionalGRU [33] by concatenating the last hidden states on bothends. With the same method, the simu_note embeddings(sorted by onsets) are summarized into score by anotherbi-directional GRU. We assume an isotropic Gaussian pos-terior, whose mean and log standard deviation are com-puted by a linear mapping of score . Algorithm 1 showsthe details.The decoder models p θ ( x | z ) in a top-down order ofthe deep structure, almost mirroring the encoding pro-cess. We use a uni-directional time-axis GRU to decode simu_note , another uni-directional (pitch-axis) GRU todecode note , a fully connected layer to decode pitch at- Figure 3 : An overview of the model architecture. Therecurrent layers are represented by rectangles and the fully-connected (FC) layers are represented by trapezoids. The note , simu_note and score events are represented bycircles.tributes, and ﬁnally another GRU to decode duration at-tribute starting from the most signiﬁcant bit. Algorithm 2shows the details.We use the ELBO (evidence lower bound) [34] as ourtraining objective. Formally, L ( φ, θ ; x ) = − E z ∼ q φ log p θ ( x | z )+ β KL (cid:16) q φ || p ( z ) (cid:17) , (1)where β is a balancing parameter used in β -VAE [35].We denote the embedding size of note , simu_note and score as e n , e sn and e sc ; the dimension of latent spaceas d z ; and the hidden dimensions or pitch-axis, time-axisand dur GRUs as h p , h t and h d respectively. In this work,we report our result on the following model size: e n = 128 , e sn = h p , dec = 2 × h p , enc = 512 , e sc = h t , dec = 2 × h t , enc =1024 , h d , dec = 64 and d z = 512 . Algorithm 1:

The

PianoTree Encoder . n , sn , sc areshort for note , simu_note , score . /* gru( · ): passes a sequence tobi-directional GRU and ouputs theconcatenation of hidden states from bothends. */ input: PianoTree x = { ( p t,k , d t,k ) , ≤ t ≤ T, ≤ k ≤ K t } foreach t, k do n t,k ← emb enc ( p t,k , d t,k ) ; foreach t do sn t ← gru pitchenc ( n t, K t ); sc ← gru timeenc ( sn T ) ; µ ← fc µ ( sc ); σ ← exp( fc σ ( sc )) ; return q ( z | x ) = N ( µ, σ ) ; lgorithm 2: The

PianoTree Decoder . We still usethe abbreviation n , sn , and sc , deﬁned in Algorithm 1 /* gru( · ), same as Algorithm 1. *//* grucell( · , · ): updates the hidden stateusing the current input and the previoushidden state. The output is replicated.*/ input: latent representation z sc ← z ; ˜ sn , ˜ n : , , d : , : , = ; for t = 1 , , ...T do [ sn t , sc ] ← grucell timedec ( ˜ sn t − , sc ); for k = 1 , , ... do [ n t,k , sn t ] ← grucell pitchdec ( ˜ n t,k − , sn t ) ; p t,k ← softmax(fc( n t,k )) ; for r = 1 , , ..., do h = [ n t,k , p t,k ] ;[ y t,k,r , h ] = grucell durdec ( d t,k,r − , h ); d t,k,r ← softmax( y t,k,r ); end d t,k = [ d t,k, ] ; if p t,k (cid:54) = then K t ← k ; break; ˜ n t,k ← emb enc ( p t,k , d t,k ) ; end ˜ sn t ← gru pitchenc ( n t, K t ); endreturn { ( p t,k , d t,k ) , ≤ t ≤ T, ≤ k ≤ K t } ; In this section, we compare PianoTree VAE with severalbaseline models. We present the dataset in Section 4.1,baseline models in Section 4.2,and the training detailsin Section 4.3. We present the objective evaluation onreconstruction accuracy in Section 4.4. In Section 4.5,we inspect and visualize the latent space of note and simu_note . After that, we present the subjective evalu-ation on latent space traversal in Section 4.6. Finally, weapply the learned representation to downstream music gen-eration task in Section 4.7.

We collect around 5K classical and popular piano piecesfrom Musicalion and the POP909 dataset [36]. We onlykeep the pieces with and meters and cut them into8-beat music segments (i.e., each data sample in our ex-periment contains 32 time steps under sixteenth note reso-lution). In all, we have 19.8K samples. We randomly splitthe dataset (at song-level) into training set (90%) and testset (10%). All training samples are further augmented bytransposing to all 12 keys. We train four types of baseline models in total using piano-roll (Section 2.2) and MIDI-like events (Section 2.3) datastructures. As a piano-roll can be regarded as either asequence or a 2-dimensional image, we couple it with Musicalion: . three neural encoder-decoder architectures: a recurrentVAE ( pr-rnn ), a convolutional VAE ( pr-cnn ), and a fully-connected VAE ( pr-fc ). For the MIDI-like events, we cou-ple it with a recurrent VAE model ( midi-seq ). All modelsshare the same latent space dimension ( d z = 512 ). Specif-ically, • The piano-roll recurrent VAE ( pr-rnn ) model is simi-lar to a 2-bar MusicVAE proposed in [4]. The hiddendimensions of the GRU encoder and decoder are both1024. • The piano-roll convolutional VAE ( pr-cnn ) architec-ture adopts a convolutional–deconvolutional architec-ture. The encoder contains 8 convolutional layers withkernel size × . Strided convolution is performed at the3 rd , 5 th , 7 th and 8 th layer with stride size (2 × , (2 × , (2 × and (2 × respectively. The decoder adoptsthe deconvolution operations in a reversed order. • The piano-roll fully-connected VAE ( pr-fc ) architec-ture uses a time-distributed 256-dimensional embeddinglayer, followed by 3 fully-connected layers with the hid-den dimensions [1024, 768] for the encoder. The de-coder adopts the counter-operations in a reversed order. • The MIDI-like event recurrent VAE ( midi-seq ) adoptsthe recurrent model structure similar to pr-rnn . Here,the event vocabulary contains 128 “note-on”, 128 “note-off” and 32 “time shift” tokens. The embedding size ofa single MIDI event is 128. The hidden dimensions ofthe encoder GRU and decoder GRU are 512 and 1024respectively.

For all models, we set batch size = 128 and use Adamoptimizer [37] with a learning rate starting from 1e-3with exponential decay to 1e-5. For PianoTree VAE, weuse teacher forcing [38] for decoder time-axis and pitch-axis GRU and for other recurrent-based baselines, we useteacher forcing in the decoders. The teacher forcing ratesstart from 0.8 and decay to 0.0. PianoTree VAE convergeswithin 6 epochs, and the baseline models converge in ap-proximately 40 to 60 epochs.

Models

PianoTree midi-seq pr-rnn pr-cnn pr-fc

Onset Precision

Table 1 : Objective evaluation results on reconstruction cri-teria. PianoTree is our proposed method. Other columnscorrespond to the baseline models described in Section 4.2.

The objective evaluation is performed by comparing dif-ferent models in terms of their reconstruction accuracy ofpitch onsets and note duration [39, 40], which are com-monly used measurements in music information retrievaltasks. For note duration accuracy, we only consider thenotes whose onset and pitch reconstruction is correct. Ta- igure 4 : A visualization of note embeddings after dimensionality reduction using PCA.ble 1 summarizes the results where we see that the Pian-oTree VAE (the 1 st column) is better than others in termsof F1 score for both criteria. Figure 4 shows the latent note space by plotting differ-ent note embeddings after dimensionality reduction byPCA (with the three largest principal components being re-served). Each colored dot is a note embedding and a totalof 1344 samples are displayed; note pitch ranges from C-1to C-8 and note duration from a sixteenth note to a wholenote.We see that the note embeddings have the desired geo-metric properties. Figure 4 (a) & (b) show that at a macrolevel, notes with different pitches are well sorted and forma “helix” in the 3-D space. Figure 4 (c) further shows thatat a micro level, 16 different note durations (with the samepitch) form a “fractal parallelogram” due to the binary en-coding of duration attributes. One of the advantages ofthe encoding method is the translation invariance property.For example, the duration difference between the upper leftcluster and the lower left cluster is 8 semiquavers, so is thedifference between the upper right and lower right cluster.The same property also applies to the four smaller-scaleparallelograms.

Figure 5 : A visualization of simu_note embeddings afterdimensionality reduction using PCA.Figure 5 is a visualization of the latent chord space byplotting different simu_note embeddings under PCA di-mensionality reduction. Each colored cluster corresponds to a chord label realized in 343 different ways (we considerall possible pitch combinations within three octaves, witha minimum of 3 notes and a maximum of 9 notes). Theduration for all chords is one beat.The geometric relationships among different chords areconsistent and human interpretable. In speciﬁc, Figure 5(a) shows the distribution of 12 different major chords,which are clustered in four different groups. By unfoldingthe circle in a counterclockwise direction, we can observethe existence of the circle of the ﬁfth . Figure 5 (b) is the vi-sualization of seven C major triad chords: forming a ring inthe order of 1-3-5-7-2-4-6 degree in the counterclockwisedirection.

Latent space traversal [4, 5, 41] is a popular technique todemonstrate model generalization and the smoothness ofthe learned latent manifold. When interpolating from onemusic piece to another in the latent space, new pieces canbe generated by mapping the representations back to thesignals. If a VAE is well trained, the generated piece willsound natural and form a smooth transition.To this end, we invite people to subjectively rate themodels through a double-blind online survey. During thesurvey, the subjects ﬁrst listen to a pair of music, andthen listen to 5 versions of interpolation, each generatedby a model listed in Table 1. Each version is a randomlyselected pair of music segments, and the interpolation isachieved using SLERP [42]. Since the experiment requirescareful listening and a long survey could decrease the qual-ity of answers, each subject is asked to rate only 3 pairs ofmusic, i.e., × interpolations in a random order.After listening to the 5 interpolations of each pair, subjectsare asked to select two best versions: one in terms of the overall musicality , and the other in terms of the smoothnessof transition .A total of n = 33 subjects (12 females and 21 males)with different music backgrounds have completed the sur-vey. The aggregated result (as in Figure 6) shows that theinterpolations generated by our model are better than theones generated by baselines, in terms of both overall musi-cality and smoothness of transition. Here, different colorsepresent different models (with the blue bars being ourmodel and other colors being the baselines), and the heightof the bars represent the percentage of votes (on the bestcandidate). Figure 6 : Subjective evaluation results of latent space in-terpolation.

In this section, we further explore whether the poly-phonic representation helps with long-term music gener-ation when coupled with standard downstream sequenceprediction models. (Similar tasks have been applied to monophonic music in [43] and [6].)The generation task is designed in the following way:given 4 measures of piano composition, we predict thenext 4 measures using a Transformer decoder (as in [44]).We compare three different music representations: MIDI-like event sequence (Section 2.1), pretrained (decoder) simu_note embeddings, and latent vector z for every 2-measure music segment (without overlap). Here z is themean of the approximated posterior from the encoder. Forall three representations, we use the same Transformer de-coder architecture (outputs of dimension = 128, number oflayers = 6 and number of heads = 8) with the same train-ing procedure. Only the loss functions are correspondinglyadjusted based on different representations: cross entropyloss is applied to midi-event tokens and MSE loss is ap-plied to both simu_note and latent vector z . We use thesame datasets mentioned in Section 4.1 and cut the origi-nal piano pieces into 8-measure subsequent clips for gen-eration purposes. We still keep 90% for training and 10%for testing.We then invited people to subjectively rate different mu-sic generations through a double-blind online survey (simi-lar to the one in Section 4.6). Subjects are asked to listen toand rate 6 music clips, each of which contains 3 versions of8-measure generation using different note representations.Subjects are told that the ﬁrst 4 measures are given and therest are generated by the machine. For each music clip,subjects rate it based on creativity , naturalness and musi-cality .A total of n = 48 subjects (20 females and 28 males)with different music backgrounds have participated in thesurvey. Figure 7 summarizes the survey results, wherethe heights of bars represent means of the ratings andthe error bars represent the conﬁdence intervals computedvia within-subject ANOVA [45]. The result shows that simu_note and latent vector z perform signiﬁcantly betterthan the midi-event tokens in terms of all three criteria (p < 0.005). Figure 7 : Subjective evaluation results of downstream mu-sic generation.Besides the aforementioned generation task, we also it-eratively feed the generated 4-measure music clips into themodel to get longer music compositions. Figure 8 showsa comparison of 16-measure generation results using allthree representations. The ﬁrst 4 bars are selected from thetest set, and the subsequent 12 bars are generated by themodels. Generally speaking, using simu_note and latentvector z as data representations yields more coherent musiccompositions. Furthermore, we noticed that long genera-tions using the simu_note representation tend to repeatprevious steps in terms of both chords and rhythms, whilethose generations using the latent vector z usually containmore variations. (a) A sample generated using midi-event tokens.(b) A sample generated using simu_note .(c) A sample generated using latent vector z . Figure 8 : Long music generations given ﬁrst 4 measures.

In conclusion, we proposed PianoTree VAE, a novelrepresentation-learning model tailored for polyphonic mu-sic. The key design of the model is to incorporate boththe music data structure and model architecture with thesparsity and hierarchical priors. Experiments show thatwith such inductive biases, PianoTree VAE achieves betterreconstruction, interpolation, downstream generation, andstrong model interpretability. In the future, we plan to ex-tend PianoTree VAE for more general musical structures,such as motif development and multi-part polyphony.

References [1] T. Zhao, R. Zhao, and M. Eskenazi, “Learningdiscourse-level diversity for neural dialog modelsusing conditional variational autoencoders,” arXivpreprint arXiv:1703.10960 , 2017.[2] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai,R. Jozefowicz, and S. Bengio, “Generating sen-tences from a continuous space,” arXiv preprintarXiv:1511.06349 , 2015.[3] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C.Courville, and Y. Bengio, “A recurrent latent variablemodel for sequential data,” in

Advances in neural in-formation processing systems , 2015, pp. 2980–2988.[4] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, andD. Eck, “A hierarchical latent vector model for learn-ing long-term structure in music,” arXiv preprintarXiv:1803.05428 , 2018.[5] R. Yang, D. Wang, Z. Wang, T. Chen, J. Jiang, andG. Xia, “Deep music analogy via latent representationdisentanglement,” arXiv preprint arXiv:1906.03626 ,2019.[6] A. Pati, A. Lerch, and G. Hadjeres, “Learning to tra-verse latent spaces for musical score inpainting,” arXivpreprint arXiv:1907.01164 , 2019.[7] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang,“Musegan: Multi-track sequential generative adversar-ial networks for symbolic music generation and accom-paniment,” in

Thirty-Second AAAI Conference on Arti-ﬁcial Intelligence , 2018.[8] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet:A convolutional generative adversarial network forsymbolic-domain music generation,” arXiv preprintarXiv:1703.10847 , 2017.[9] I. Simon, A. Roberts, C. Raffel, J. Engel,C. Hawthorne, and D. Eck, “Learning a LatentSpace of Multitrack Measures,” arXiv e-prints , p.arXiv:1806.00195, Jun 2018.[10] F. Lerdahl and R. S. Jackendoff,

A generative theory oftonal music . MIT press, 1996.[11] J. Rothgeb,

Introduction to the theory of HeinrichSchenker: the nature of the musical work of art . NewYork: Longman, 1982.[12] M. Hamanaka, K. Hirata, and S. Tojo, “Implementing“a generative theory of tonal music”,”

Journal of NewMusic Research , vol. 35, no. 4, pp. 249–277, 2006.[13] ——, “ σ gttm iii: Learning-based time-span treegenerator based on pcfg,” in International Sympo-sium on Computer Music Multidisciplinary Research .Springer, 2015, pp. 387–404. [14] S. W. Smoliar, “A computer aid for schenkerian anal-ysis,” in

Proceedings of the 1979 annual conference ,1979, pp. 110–115.[15] A. Marsden, “Schenkerian analysis by computer: Aproof of concept,”

Journal of New Music Research ,vol. 39, no. 3, pp. 269–289, 2010.[16] G. Hadjeres, F. Pachet, and F. Nielsen, “Deepbach: asteerable model for bach chorales generation,” in

Pro-ceedings of the 34th International Conference on Ma-chine Learning-Volume 70 . JMLR. org, 2017, pp.1362–1371.[17] H. Zhu, Q. Liu, N. J. Yuan, C. Qin, J. Li, K. Zhang,G. Zhou, F. Wei, Y. Xu, and E. Chen, “Xiaoice band:A melody and arrangement generation framework forpop music,” in

Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery &Data Mining . ACM, 2018, pp. 2837–2846.[18] C.-Z. A. Huang, T. Cooijmans, A. Roberts,A. Courville, and D. Eck, “Counterpoint by convolu-tion,” in

International Society for Music InformationRetrieval (ISMIR) , 2017.[19] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer,C. Hawthorne, A. M. Dai, M. D. Hoffman, and D. Eck,“Music transformer: Generating music with long-termstructure,” arXiv preprint arXiv:1809.04281 , 2018.[20] G. Brunner, Y. Wang, R. Wattenhofer, and J. Wiesen-danger, “Jambot: Music theory aware chord based gen-eration of polyphonic music with lstms,” in . IEEE, 2017, pp. 519–526.[21] H. H. Mao, T. Shin, and G. Cottrell, “Deepj: Style-speciﬁc music generation,” in .IEEE, 2018, pp. 377–382.[22] E. S. Koh, S. Dubnov, and D. Wright, “Rethinking re-current latent variable model for music composition,”in . IEEE, 2018, pp.1–6.[23] N. Boulanger-Lewandowski, Y. Bengio, and P. Vin-cent, “Modeling temporal dependencies in high-dimensional sequences: Application to polyphonicmusic generation and transcription,” arXiv preprintarXiv:1206.6392 , 2012.[24] I. Simon and S. Oore, “Performance rnn: Generatingmusic with expressive timing and dynamics,” https://magenta.tensorﬂow.org/performance-rnn, 2017.[25] C. Donahue, H. H. Mao, Y. E. Li, G. W. Cot-trell, and J. McAuley, “Lakhnes: Improving multi-instrumental music generation with cross-domain pre-training,” arXiv preprint arXiv:1907.04868 , 2019.26] Y.-S. Huang and Y.-H. Yang, “Pop music transformer:Generating music with rhythm and harmony,” arXivpreprint arXiv:2002.00212 , 2020.[27] O. Mogren, “C-rnn-gan: Continuous recurrent neu-ral networks with adversarial training,” arXiv preprintarXiv:1611.09904 , 2016.[28] C. Hawthorne, A. Huang, D. Ippolito, and D. Eck,“Transformer-nade for piano performances.”[29] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner,and G. Monfardini, “The graph neural network model,”

IEEE Transactions on Neural Networks , vol. 20, no. 1,pp. 61–80, 2008.[30] D. Jeong, T. Kwon, Y. Kim, and J. Nam, “Graph neu-ral network for music score data and modeling expres-sive piano performance,” in

International Conferenceon Machine Learning , 2019, pp. 3060–3070.[31] C. Dyer, A. Kuncoro, M. Ballesteros, and N. A. Smith,“Recurrent neural network grammars,” arXiv preprintarXiv:1602.07776 , 2016.[32] K. S. Tai, R. Socher, and C. D. Manning, “Im-proved semantic representations from tree-structuredlong short-term memory networks,” arXiv preprintarXiv:1503.00075 , 2015.[33] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bah-danau, F. Bougares, H. Schwenk, and Y. Bengio,“Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXivpreprint arXiv:1406.1078 , 2014.[34] D. P. Kingma and M. Welling, “Auto-encoding varia-tional bayes,” arXiv preprint arXiv:1312.6114 , 2013.[35] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot,M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrainedvariational framework,” 2016.[36] Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai,X. Gu, and G. Xia, “Pop909: A pop-song dataset formusic arrangement generation,” in

Proceedings of 21stInternational Conference on Music Information Re-trieval (ISMIR), virtual conference , 2020.[37] D. P. Kingma and J. Ba, “Adam: A method for stochas-tic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[38] N. B. Toomarian and J. Barhen, “Learning a trajectoryusing adjoint functions and teacher forcing,”

Neuralnetworks , vol. 5, no. 3, pp. 473–484, 1992.[39] C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Si-mon, C. Raffel, J. Engel, S. Oore, and D. Eck, “Onsetsand frames: Dual-objective piano transcription,” arXivpreprint arXiv:1710.11153 , 2017. [40] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon,O. Nieto, D. Liang, D. P. Ellis, and C. C. Raffel,“mir_eval: A transparent implementation of commonmir metrics,” in

In Proceedings of the 15th Interna-tional Society for Music Information Retrieval Confer-ence, ISMIR . Citeseer, 2014.[41] R. Yang, T. Chen, Y. Zhang, and G. Xia, “Inspectingand interacting with meaningful music representationsusing vae,” arXiv preprint arXiv:1904.08842 , 2019.[42] A. Watt and M. Watt, “Advanced animatidn and ben-dering technidues,” 1992.[43] K. Chen, G. Xia, and S. Dubnov, “Continuous melodygeneration via disentangled short-term representationsand structural conditions,” , pp.128–135, 2020.[44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,L. Jones, A. N. Gomez, L. Kaiser, and I. Polo-sukhin, “Attention is all you need,”

ArXiv , vol.abs/1706.03762, 2017.[45] H. Scheffe,