[PDF] Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

Abstract

To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note's pitch, duration, velocity (dynamics), and placement (onset time) along the time grid. While different types of tokens may possess different properties, existing models usually treat them equally, in the same way as modeling words in natural languages. In this paper, we present a conceptually different approach that explicitly takes into account the type of the tokens, such as note types and metric types. And, we propose a new Transformer decoder architecture that uses different feed-forward heads to model tokens of different types. With an expansion-compression trick, we convert a piece of music to a sequence of compound words by grouping neighboring tokens, greatly reducing the length of the token sequences. We show that the resulting model can be viewed as a learner over dynamic directed hypergraphs. And, we employ it to learn to compose expressive Pop piano music of full-song length (involving up to 10K individual tokens per song), both conditionally and unconditionally. Our experiment shows that, compared to state-of-the-art models, the proposed model converges 5--10 times faster at training (i.e., within a day on a single GPU with 11 GB memory), and with comparable quality in the generated music.

Full PDF

CCompound Word Transformer: Learning to Compose Full-Song Musicover Dynamic Directed Hypergraphs

Wen-Yi Hsiao, Jen-Yu Liu, , Yin-Cheng Yeh, Yi-Hsuan Yang

1, 2 Yating Team, Taiwan AI Labs, Taiwan Academia Sinica, Taiwan { wayne391, jyliu, yyeh, yhyang } @ailabs.tw Abstract

To apply neural sequence models such as the Transformers tomusic generation tasks, one has to represent a piece of musicby a sequence of tokens drawn from a ﬁnite set of pre-deﬁnedvocabulary. Such a vocabulary usually involves tokens of var-ious types . For example, to describe a musical note, one needsseparate tokens to indicate the note’s pitch, duration, velocity(dynamics), and placement (onset time) along the time grid.While different types of tokens may possess different proper-ties, existing models usually treat them equally, in the sameway as modeling words in natural languages. In this paper, wepresent a conceptually different approach that explicitly takesinto account the type of the tokens, such as note types and metric types. And, we propose a new Transformer decoder ar-chitecture that uses different feed-forward heads to model to-kens of different types. With an expansion-compression trick,we convert a piece of music to a sequence of compound words by grouping neighboring tokens, greatly reducing the lengthof the token sequences. We show that the resulting model canbe viewed as a learner over dynamic directed hypergraphs.And, we employ it to learn to compose expressive Pop pianomusic of full-song length (involving up to 10K individual to-kens per song), both conditionally and unconditionally. Ourexperiment shows that, compared to state-of-the-art models,the proposed model converges 5–10 times faster at training(i.e., within a day on a single GPU with 11 GB memory), andwith comparable quality in the generated music.

Introduction

To apply neural sequence models such as recurrent neuralnetworks (RNNs) or Transformers (Vaswani et al. 2017) toautomatic music composition (a.k.a., symbolic-domain mu-sic generation), one has to represent a piece of music as asequence of tokens drawn from a pre-deﬁned vocabulary (Oore et al. 2018). Unlike the case in text, such a vocabu-lary usually involves tokens of various types . For example, torepresent a musical score, we may need tokens that describethe content of the musical notes (e.g., pitch and duration),their placement along time, the instrument that plays eachnote, as well as indicators of metrical events such as the be-ginning of a new beat, bar (measure), or musical phrase (Wuand Yang 2020). We need such a diverse set of tokens asmusic is multifaceted; a type alone captures only a certain

Figure 1: Illustration of the main ideas of the proposed com-pound word Transformer: (left) compound word modeling that combines the embeddings (colored gray) of multiple to-kens { w t − ,k } Kk =1 , one for each token type k , at each timestep t − to form the input (cid:126) x t − to the self-attention layers,and (right) toke type-speciﬁc feed-forward heads that predictthe list of tokens for the next time step t at once at the output.aspect of music (e.g., melody, harmony, rhythm, timbre) andcannot faithfully represent a music piece.As different types of (musical) tokens may have differentproperties, modeling the dependency of these tokens mightnot be the same as modeling words in text. However, to ourbest knowledge, little work has been done to explicitly ac-count for the heterogeneity of tokens in music. The tokensare mostly treated equally, in the same way as words in text(Huang et al. 2019; Payne 2019; Huang and Yang 2020).We are therefore motivated to study in this paper whetherwe can improve sequence modeling of music by highlightingthe role of token types. Our ﬁrst proposal is to customize theprediction heads for tokens of different types . Speciﬁcally,using the Transformer as the main architecture of the under-lying sequence model, we approach this by using differentfeed-forward heads for tokens of different types.Our second proposal is to group consecutive and relatedtokens in a token sequence into “compound words,” and then perform sequence modeling over the resulting sequence ofcompound words . This is to capture the co-occurrence re-lationship of tokens—e.g., to generate a new musical note,we may need at least two consecutive tokens to indicate itspitch and duration; to change the tempo in the middle of apiece of music, we need a token to indicate the target tempovalue, and an co-occurring time-related token to indicate the a r X i v : . [ c s . S D ] J a n epresentation Model Attn. window Voc. size Data typeMusic Transformer (Huang et al. 2019) MIDI-like Transformer 2,048 388 Classical piano performanceMuseNet (Payne 2019) MIDI-like* Transformer 4,096 N/A Multi-track MIDILakhNES (Donahue et al. 2019) MIDI-like* Transformer-XL 512 630 Multi-track MIDITR autoencoder (Choi et al. 2020) MIDI-like Transformer 2,048 388 Classical piano performancePop Music TR (Huang and Yang 2020) REMI Transformer-XL 512 332 Pop piano performanceTransformer VAE (Jiang et al. 2020) MIDI-like Transformer 128 47 Pop lead sheetsGuitar Transformer (Chen et al. 2020) REMI* Transformer-XL 512 221 Guitar tabsJazz Transformer (Wu and Yang 2020) REMI* Transformer-XL 512 451 Jazz lead sheetsMMM (Ens and Pasquier 2020) MIDI-like* Transformer 2,048 >

442 Multi-track MIDIThis work CP linear Transformer 5,120 350 Pop piano performance

Table 1: A comparison of existing Transformer-based models and the proposed one for automatic music composition. Therepresentations marked with * are extensions of either MIDI-like (Oore et al. 2018) or REMI (Huang and Yang 2020).time of the tempo change. Under the proposed compound-word modeling, the individual tokens (e.g., pitch and dura-tion) are still predicted separately with different heads. Yet,instead of predicting them at different time steps, we predictmultiple tokens of various types at once in a single time step.The token embeddings of the tokens predicted at the currentstep are then combined and fed as the input for the next timestep. Namely, the self-attention is computed over combinedembeddings of individual tokens of a compound word.From a theoretical point of view, the proposed model canbe interpreted as a learner over discrete-time dynamic di-rected hypergraphs (Kazemi et al. 2020). Here, a graph con-sists of nodes that each corresponds to a token in our vocabu-lary. A sequence of tokens can then be viewed as a sequenceof edges (each connecting two nodes), or a walk , over thisgraph. A sequence of compound words, in contrast, can beviewed as a sequence of hyperedges (each connecting mul-tiple nodes) (Feng et al. 2019), over the same graph. We dis-cuss this at greater length later in the paper.We refer to the proposed representation as the compoundword representation , or CP for short. CP can be consideredas an extension of existing representations, with the follow-ing additional merits. First, it allows for ﬁne-grained, type-speciﬁc control over the prediction heads. For example, wecan now use different loss functions, sampling policies, andtoken embedding sizes for different token types.Second, as a compound word represents multiple tokensat once, it requires much less time steps to generate a musicpiece using compound words. Namely, the sequence lengthof the same music piece is much shorter in CP than in ex-isting representations. As the computational complexity of aTransformer is related to the sequence length (Vaswani et al.2017), this makes training and inference faster, and may fa-cilitate learning the long-range dependency in music. Finally, the sequence length in CP is determined by thenumber of compound words in a sequence, not by the num-ber of individual tokens per compound word. Therefore, it ispossible to add new token types (by adding the correspond-ing feed-forward head) to increase the expressivity of therepresentation, without increasing the sequence length. This For example, we can study whether the proposed model cre-ates music with better “structureness,” or long-term repetitions (Wuand Yang 2020; Jhamtani and Berg-Kirkpatrick 2019) in the future. makes it easy to extend to underlying representation, thoughwe do not explore this potential in this work.For performance study, we consider generating expressivePop piano music at full-song scale in both the unconditionalsetting (i.e., from scratch) and conditional setting (i.e., gen-erating the piano arrangement given the lead sheet). This in-volves modeling fairly long music sequences for up to 10Kindividual tokens each. We show that, with CP, we are ableto train a linear Transformer decoder (Katharopoulos et al.2020) with music quality similar to that of strong baselines,with faster training and inference time. We provide audioexamples and open source the project at a GitHub repo. Related Work

Both language and music have principles governing the or-ganization of discrete structural elements (e.g., words or mu-sical notes) into sequences (Patel 2003). As such, the Trans-formers, which have been ﬁrstly shown to work well for textgeneration (Child et al. 2019; Keskar et al. 2019), have beenincreasingly applied to music generation in recent years, bytreating music pieces as sequences of discrete tokens akin totext words. We list some related papers in Table 1.Table 1 shows that most existing work adopt a music rep-resentation derived from either MIDI-like (Oore et al. 2018)or REMI (Huang and Yang 2020), with possible addition oftrack- or structure-related tokens. MIDI-like and REMI dif-fer mainly in how the advance of time is represented: theformer uses [time shift] tokens to mark the time interval (inabsolute time) between note-related tokens, whereas the lat-ter assumes symbolic timing and uses [bar] and [position]tokens to place tokens on a metrical grid that uniformly di-vides a bar into a certain number of positions. Neither MIDI-like nor REMI groups the tokens by token types. Existing work also differs in the length of the attentionwindow (see the methodology section for deﬁnition) and vo-cabulary size (which is data- and task-dependent). To ourknowledge, our work represents the ﬁrst one to consider Popmusic modeling at full-song scale (involving 10k tokens persong), and to use the recently-proposed linear Transformer(Katharopoulos et al. 2020) as the model backbone. https://github.com/YatingMusic/compound-word-transformer Upon paper completion, we noticed an early but preliminaryattempt of grouping tokens by (Hawthorne et al. 2018b). ethodology

Background

For sequence modeling, we need a conversion function g ( · ) that converts a music piece X to a time-ordered sequence ofsymbolic elements S = g ( X ) = { w , w , . . . , w T } , where T denotes the resulting sequence length. Given a number ofsuch sequences, we train a neural sequence model with anarchitecture such as the Transformer decoder to learn to gen-erate new sequences S (cid:48) . We then use a deterministic inversefunction g − ( · ) to get a new music piece from such a gener-ated sequence, namely X (cid:48) = g − ( S (cid:48) ) . There can be differ-ent algorithms to implement the conversion function and itsinverse, leading to numerous possible sequence representa-tions of the same music piece, e.g., S MIDI-like = g MIDI-like ( X ) and S REMI = g REMI ( X ) . Different conversion functions (orsequence representations) assume different vocabulary sizes M , so S MIDI-like and S REMI differ in both T and M .A Transformer decoder comprises a stack of self-attention layers and a stack of feed-forward layers. The self-attentionlayers operate on a ﬁxed-length sub-sequence of S to learnthe dependency among the elements. The length of such asub-sequence, a.k.a., the attention window , denoted as N ,is usually much smaller than T , as N directly affects thespace complexity of the model. For the vanilla Transformer(Vaswani et al. 2017) and its faster variant Transformer-XL(Dai et al. 2019), it is O ( N M ) ; for the linear Transformer(Katharopoulos et al. 2020), it is O ( N M ) . Individual Tokens vs Compound Words

In this paper, we refer to the elements in either S MIDI-like or S REMI as the individual tokens . They are drawn from a pre-deﬁned vocabulary V = { , . . . , M } . As mentioned in theintroduction, each token is associated with a type deﬁned inthe type set, K = { , . . . , K } . We can partition V into K subsets by token group, i.e., {V k } Kk =1 .We propose to convert a sequence of tokens (e.g., S REMI )into a sequence of compound words S CP with the followingprocedure. First, neighboring tokens that deﬁne a musicalevent together are grouped into a super token , i.e., placedon the same time step, as illustrated in Figures 2(a)–(b). Amusical event here can be a note related one, i.e., to createa new musical note, or a metrical related one, e.g., to markthe beginning of a new beat, or a new bar. For example, inREMI, a note is created by consecutive tokens of [pitch],[duration], and [velocity], which are grouped in CP. And,a tempo or chord change in REMI takes place only at beattimes, so we also group [beat], [chord] and [tempo]. Accord-ingly, the model has to make multiple predictions (i.e., gen-erate multiple tokens) at each time step.Second, we ﬁll the missing token types per time step with“[ignore]” tokens , so that at each step there are consistently K tokens to be predicted, as illustrated in Figure 2(c). This isto make computational modeling feasible, as otherwise theshape and meaning of the target output at each time stepwould be uncertain. In other words, a compound word iscomposed of a list of K tokens, each drawn from the cor-responding subset V k ∪ [ignore], that are placed on the sametime step t . Formally, S CP = g CP ( X ) = { cp t } T cp t =1 , in which (a) REMI representation(b) Tokens grouped (c) Compound wordsFigure 2: An example illustrating the conversion from asequence of REMI tokens (Huang and Yang 2020) into a(shorter) sequence of compound words. A compound wordcomprises a number of grouped tokens and the [ignore] to-kens, which are colored white in (c), as well as a family to-ken ( N : note-related or M : metric-related). Best seen in color. cp t = { w t, , · · · , w t,K } . We view this conversion function g CP ( · ) as performing an expansion-compression trick , as theoriginal sequence is ﬁrstly expanded to a sequence of KT CP individual tokens, and then compressed to a sequence of T CP compound words; in general T CP < T REMI < KT CP .To facilitate modeling the CP, we further partition the typeset K into F families . For example, if K can be partitionedinto two families, the note family K N and metric family K M (marked as ‘ N ’ and ‘ M ’ in Figure 2(c)), we would have K = K N ∪ K M , and K N ∩ K M = ∅ . Each compound word cp t isassociated with a family token f t . For a metric-related cp t ,we would have w t,k = [ignore], for k ∈ K N . Similarly, for anote-related cp t , w t,k = [ignore], for k ∈ K M . Combining Token Embeddings of Adaptive Sizes

As input to Transformers, an element in a sequence is rep-resented by an embedding vector, x t ∈ R d , and then addedwith a positional embedding vector (Ke, He, and Liu 2020).In CP, we propose to form an embedding vector for a com-pound word cp t by combining the embedding vectors p t,k ofthe composing tokens w t,k , as well as an embedding vector q t associated with the family token f t . Speciﬁcally, we com-bine the vectors by ﬁrstly concatenating them, and then lin-early projecting the resulting long vector to a d -dimensionalvector with a projection matrix W in . Namely, p t,k = Embedding k ( w t,k ) , k = 1 , ..., K , q t = Embedding F ( f t ) , x t = W in [ p t, ⊕ ... ⊕ p t,K ⊕ q t ] ,(cid:126) x t = Positional Encoding ( x t ) , (1)here ⊕ denotes vector concatenation, and Embedding k ( · ) and Embedding F ( · ) involve the use of lookup tables.In essence, x t can be considered as a compressive repre-sentation of the composing tokens w t,k and family token f t .We note the action of compressing the embeddings is rem-iniscent of the main idea of the Compressive Transformer(Rae et al. 2020), which proposes to compresses past mem-ories beyond the attention window for long-range sequencelearning. Unlike it, we compress the memories within theattention window deﬁned over the individual tokens.A main merit of CP is that we can customize the settingsfor different token types. Being inspired by the adaptiveword representation (Baevski and Auli 2018), we use dif-ferent embedding sizes d k for tokens of different types, i.e., p t,k ∈ R d k . We basically use larger d k for token types withlarger vocabulary size |V k | . See Table 3 for details. Multi-head Output Module

A main proposal of our work is to use different feed-forwardheads for tokens of different types in a Transformer. Specif-ically, we have ( K + 1) heads in total, one for each tokentype V k and an additional one for the token family F .Instead of working on the K + 1 heads at the same time,we devise a two-stage setting that predicts the family tokenﬁrst, and then the remaining tokens given the family token.Speciﬁcally, at the t -th time step, the feed-forward procedurecan be summarized as: h t = Self-attn ( (cid:126) x t − ) , (cid:98) f t = Sample F ( softmax ( W F h t )) , h out t = W out [ h t ⊕ Embedding F ( (cid:98) f t )] , (cid:100) w t,k = Sample k (cid:0) softmax ( W k h out t ) (cid:1) , k = 1 , ..., K , (2)where W F and { W k } Kk =1 are the K +1 feed-forward heads,Self-attn ( · ) the causal self-attention layers, and Sample ( · ) asampling function. We empirically ﬁnd that this two-stagesetting makes it easier for the model to predict w t,k = [ignore], for k not in the target family K (cid:98) f t .Figure 1 illustrates Eqs. (1)–(2) in work, omitting the ﬁrst-stage part at the output for (cid:98) f t due to space limit. Adaptive Sampling Policy

At inference time, we use stochastic temperature-controlledsampling (Holtzman et al. 2020) to avoid degeneration andto increase diversity. With CP, we employ different samplingpolicies Sample k ( · ) for different token types; see Table 3. Graph Interpretation

We discuss the proposed model from a graph-theoreticalpoint of view below. Given a vocabulary of tokens, we canconstruct a fully-connected static graph G = ( V , E ) (Kivel¨aet al. 2014) comprising nodes V = { , . . . , M } and edges E = V × V . Each node corresponds to an individual tokenin our vocabulary. This way, a token sequence S MIDI-like or S REMI can be viewed as a sequence of edges (each connect-ing two nodes), or a walk , over this graph. In CP, the vocabulary (and accordingly the graph) is aug-mented with a set of special tokens , denoted as V ∗ , that in-cludes for example type-speciﬁc [ignore] tokens and familytokens. And, a compound word consists of K + 1 nodes, onefrom each of the K types and an additional one from the setof family tokens. A sequence of compound words, namely S CP , therefore, involves transitions from K + 1 nodes to an-other K + 1 nodes per time step. Such a transition can beviewed as a directed hyperedge (Feng et al. 2019; Jiang et al.2019), that connects at once K +1 source nodes (e.g., cp t − )to K + 1 target nodes ( cp t ). It is directed because the orderof the nodes matters (i.e., from t − to t ).A sequence of compound words also forms a dynamic di-rected hypergraph (Kazemi et al. 2020): {G , G , . . . , G T } ,where G t = ( V , E t ) . Starting from an empty graph with noedges, at each time step t > we add a new directed hyper-edge, labeled with the time step t , connecting in total K +2 nodes. In practice, we have a [BOS] token (beginning of se-quence) and [EOS] token (end of sequence), so the hyper-edge at t = 1 and t = T connects to only K + 2 nodes.A neural model for graphs, or a graph neural network (GNN), can be regarded as an encoder-decoder pair (Kazemiet al. 2020; Rossi et al. 2020), where an encoder is a func-tion that maps from a graph G to node embeddings z i , i =1 . . . M , and a decoder takes as input one ore more nodeembeddings and makes a prediction based on these, e.g.,node classiﬁcation or edge prediction. The proposed CPTransformer can therefore be regarded as a learner over dy-namic directed hypergraphs, as at each time step t it man-ages to predict the next hyperedge to be added (i.e., (cid:100) w t,k and (cid:98) f t ) based on the node embeddings updated from G

To test the effectiveness of the proposed methods, we im-plement a CP Transformer that learns to generate Pop pianomusic with human performance characteristics such as ex-pressive variations in velocity (i.e., the force with which anote is played, which is related to loudness) and tempo (Ooreet al. 2018; Lerch et al. 2019). We consider Pop piano for itsrichness and expressivity, and for offering a direct perfor-mance comparison with the Pop Music Transformer (Huangand Yang 2020) (see Table 1).Speciﬁcally, we consider both the conditional and un-conditional generation tasks. In the former, a lead sheet (i.e., a melody line and an accompanying sequence of chordlabels) is given, and the model has to generate a piano per-formance according to that. In the latter, the model generatesa piano performance of full-song length from scratch freely.We intend to compare CP with REMI in our evaluation.We provide the implementation details below. ask Repre. T )mean ( ± std) maxConditional REMI 6,432 ( ± ± ± ± Table 2: Statistics of the number (

Dataset

We collect the audio ﬁles of 1,748 pieces of Pop piano fromthe Internet. The average length of the songs is about 4 min-utes, and we have about 108 hours in total. All the songs arein 4/4 time signature (four beats per bar). We convert eachsong (an audio) into a symbolic sequence as follows.•

Transcription : We use the state-of-the-art RNN modelfor automatic piano transcription, “Onset and Frames”(Hawthorne et al. 2018a), to estimate the pitch, onset andoffset time, and velocity of the musical notes from audio.•

Synchronization : To get symbolic timing from the origi-nal wall clock time, we use the RNN-based model avail-able in the Python package madmom (B¨ock et al. 2016)to estimate the downbeat and the beat positions, whichrepresent the state-of-the-art for the task. Then, we inter-polate 480 ticks between two adjacent beats, and map theabsolute time into its according tick. By doing so, we cankeep tiny offset. Lastly, we infer the tempo changes fromthe time interval between adjacent beats.•

Quantization : We quantize the tempo, velocity, durationand the beat positions to reduce the size of the vocabulary.For example, we set the 16-th note as our basic time unit.See Table 3 for the number of tokens per type.•

Analysis : For the conditional generation task, we esti-mate the melody notes and chord symbols from the tran-scription result to form the lead sheets. Speciﬁcally, wedevelop an in-house rule-based chord recognition algo-rithm to recognize 12 roots and 7 chord qualities. We usethe “Skyline algorithm” (Uitdenbogerd and Zobel 1999)to extract the melodies. And, as a lead sheet is usually ofcoarser time resolution, we quantize the chord symbolsand melody notes to the 4-th notes (i.e., beat times).We randomly hold out 50 songs for testing, and use the re-maining for training the Transformers. Vocabulary

To represent the content of a piano performance, the basicsetting employs tokens of six types: three note-related types[pitch], [duration], [velocity], and three metric-related types[position/bar], [tempo], [chord]. The speciﬁc vocabulary istask-dependent and is introduced below.

Conditional generation —We additionally use [track] to-kens to mark whether it is the lead sheet track (i.e., the con-dition) or the piano track (the track to be generated). While https://github.com/joshuachang2311/chorder Repre. Token type Voc. size Embed. Sample k ( · ) |V k | size ( d k ) τ ρ CP [track] 2 (+1) 3 1.0 0.90[tempo] 58 (+2) 128 1.2 0.90[position/bar] 17 (+1) 64 1.2 1.00[chord] 133 (+2) 256 1.0 0.99[pitch] 86 (+1) 512 1.0 0.90[duration] 17 (+1) 128 2.0 0.90[velocity] 24 (+1) 128 5.0 1.00[family] 4 32 1.0 0.90total 341 (+9) — — —REMI total 338 512 1.2 0.90

Table 3: Details of the CP representation in our implementa-tion, including that of the sampling policy ( τ -tempered top- ρ sampling). For the vocabulary size, the values in the paren-theses denote the number of special tokens such as [ignore].the piano track (i.e., the sub-sequence after the [track=piano]token) involves all the six types of tokens mentioned above,the lead sheet track only involves the use of composition-related tokens [position/bar], [chord], [pitch], [duration],not performance-related tokens [velocity], [tempo]. In CP,we have three family tokens, [family=track], [family=note],[family=metric]. Moreover, we have type-speciﬁc [ignore]tokens and an additional [conti] token for the beat positionshaving no tempo or chord changes. Unconditional generation —This task only concerns withthe piano track so we do not need the [track] tokens. But, asit concerns with full-song generation, we add an [EOS] to-ken to signify the end of a sequence. We view it as a familytoken, so there are three possible family tokens here: [fam-ily=EOS], [family=note], [family=metric].Details of the adopted representations are shown in Tables2 and 3. Table 2 compares the sequence length T of REMIand CP. We can see that S CP is much shorter than S REMI ,especially under the conditional task. Table 3 displays thesize of each vocabulary subset V k . We see that CP and REMIhave similar total vocabulary size M . REMI does not use thefamily tokens (except for [EOS]) and special tokens. Model Settings

For the backbone architecture of our model, we employ thelinear Transformer (Katharopoulos et al. 2020), as its com-plexity is a linear function of the length of the attention win-dow N . Moreover, we set N equal to the sequence length T for our model. That is, no segmentation over the trainingsequences is done, and thereby all the tokens in a sequencecan be accessed by our model under causal masking, with-out using tricks such as memory caching (Dai et al. 2019)or memory compression (Rae et al. 2020). We refer to ourmodel as CP + linear in what follows.For the baselines , we employ the Pop Music Transformer We set an upper limit of the number of elements per sequence(e.g., 10,240 tokens in REMI) and remove overly long songs, whichamounts to removing 25–88 songs from the training set dependingon the task and the adopted representation. https://github.com/idiap/fast-transformersask Representation + model@loss Training GPU Inference (/song) Matchnesstime memory time (sec) tokens ( + [email protected] 3 days 4 GB 88.4 4,782 0.872 0.785REMI + [email protected] 7 days 4 GB 91.5 4,890 0.866 0.800REMI + [email protected] 3 days 17 GB 48.9 4,327 0.779 0.709CP + [email protected] 0.6 days 10 GB 29.2 18,200 0.829 0.733Unconditional REMI + [email protected] 3 days 4 GB 139.9 7,680 — —CP + [email protected] 1.3 days 9.5 GB 19.8 9,546 — — Table 4: Quantitative evaluation result of different models. REMI + XL represents a re-implementation of the state-of-the-artPop Music Transformer (Huang and Yang 2020), while CP + linear stands for the proposed CP Transformer.(Huang and Yang 2020), which is open-source and standsfor a state-of-the-art for unconditional music composition. This

REMI + XL model adopts the REMI representation anduses Transformer-XL (Dai et al. 2019) as the model back-bone. As its complexity grows quadratically with N , we set N = 512 , following (Huang and Yang 2020).Moreover, we consider one more baseline that replacesTransformer-XL by linear Transformer, using also N = T ,to offer a sensible performance comparison between CP andREMI. We refer to this variant as REMI + linear .We use 12 self-attention layers each with 8 attention headsfor all the models for fair comparison. The model hiddensize and inner layer of the feed-forward part are set to 512and 2,048, respectively. For the token embedding size d , weﬁx it to 512 for REMI, following (Huang and Yang 2020).For CP, we set it adaptively based on the vocabulary sizeof each token type, as shown in Table 3. For sampling, weemploy the “nucleus sampling” (Holtzman et al. 2020), astochastic method that samples from the smallest subset oftokens whose cumulative probability mass exceeds a thresh-old ρ ∈ [0 , . Before sampling, we reshape the probabilitydistribution of the tokens (e.g., softmax( W k h out t )) through“temperature” (Ackley, Hinton, and Sejnowski 1985), withthe temperature parameter τ > . As Table 3 also shows, weuse different ρ and τ for different token types. For example,we use a large τ to encourage diverse velocity values.The conditional generation task can be approached witha sequence-to-sequence model, since we have paired data oflead sheets and piano performances (i.e., the former is ex-tracted automatically from the latter). Instead of adding aTransformer encoder (as done in (Choi et al. 2020)) to re-alize this, we use the encoder-free “Preﬁx LM” method ofthe Google’s “T5” model (Raffel et al. 2020), and run a sin-gle Transformer over an interleaved sequence of lead sheetsand piano performances. Speciﬁcally, a sequence of leadsheet and the corresponding target sequence of piano perfor-mance are integrated into one sequence bar after bar. Thatis, the integrated sequence would have the form of { . . . ,[bar], [track=leadsheet], (content of the lead sheet for a bar),[track=piano], (content of the piano for the same bar), [bar],(content of the two tracks of the next bar) . . . } . This makesit easy to learn the dependency of the two tracks, and to im-pose the pre-given lead sheet at inference time. https://github.com/YatingMusic/remi Quantitative Evaluation

The experiments hereafter are conducted in the interest of aresource-constrained scenario, assuming that we only havea single GPU with 11 GB memory and are only willing totrain a model for 3 days. We conjecture that this makes sensefor most middle-size academic labs worldwide. Yet, to havean idea of the model performance when more resources areavailable, we include to the evaluation of the conditional tasktwo settings exceeding such a speciﬁcation.We ﬁrstly compare the efﬁciency of the models in termsof training time, inference time, and GPU memory usage,under the conditional setting. The average result over the 50held-out test songs is shown in Table 4.

GPU memory usage . Table 4 shows that both CP + linearand REMI + XL require <

11 GB GPU memory for training.Accordingly, in our implementation, we train them (sepa-rately) on an NVIDIA RTX 2080 Ti GPU (with 11GB mem-ory). In contrast, REMI + linear requires 17 GB GPU mem-ory, so we train it on a TITAN GPU with 24 GB memory. Training time . We see that REMI-based models requiremuch longer clock time to reach a low training loss. Whileit takes nearly 7 days for REMI + XL to reduce the negativelog-likelihood (NLL) of the training data to 0.27, it takesonly 0.6 days for CP + linear to reach the same NLL. Such atraining efﬁciency is desirable (especially given that it is ona single 2080 Ti GPU), as it makes further extensions andmodiﬁcations of the model easy and affordable. Inference time . CP + linear is remarkably fast, taking onaverage <

30 seconds to complete the conditional generationof a song. As a song in our dataset is about 4 minutes, thisis much faster than real time. In contrast, REMI + XL andREMI + linear are about 3x and 1.7x slower, respectively.CP + linear is fast for it generates in total 8 individual tokens(of different types) at once each time step.Table 4 also compares the efﬁciency of REMI + XL andCP + linear under the unconditional setting, for which wegenerate also 50 songs (from scratch) and report the averageinference time. We see that CP + linear is even faster here, re-quiring only <

20 seconds to create a new song at full-songlength. In contrast, REMI + XL is on average 7x slower.Next, we compare the performance of the models in termsof two objective metrics, also under the conditional setting.As the goal is to generate a song given a lead sheet, we canmeasure whether the generated song has a melody line and epre. + model@loss F R H C O

REMI + [email protected] 4.05 3.12 3.38 3.55 3.31REMI + [email protected]

REMI + [email protected] 4.03 3.09 3.48 3.46 3.29CP + [email protected] 4.09 3.13 3.50 3.31 3.08 (a) Conditional generation Repre. + model@loss R H S O

REMI + [email protected] 3.11 3.46 2.91 3.03CP + [email protected] (b) Unconditional generationTable 5: Result of subjective evaluation ( F idelity, R ichness, H umanness, C orrectness, S tructureness, O verall).chord progression similar to that in the given condition, andtake that as a ﬁgure of merit. (In contrast, proper objectiveevaluation of unconditional generation models remains anopen issue (Yang and Lerch 2020; Dong et al. 2020; Wu andYang 2020).) Speciﬁcally, we consider:• Melody matchness . We represent the lead sheet and thecorrespondingly generated piano both in the REMI formatand compute the bar-wise longest common sub-sequence (LCS) of the two resulting sequences S LSREMI and (cid:98) S pianoREMI .When two notes (each from the two sequences) have thesame pitch and close onset time (within the 8-th note), weconsider that as a match. We divide the length of the LCSby the number of [pitch] tokens in S LSREMI (i.e., the numberof target melody notes) of that bar, and take the averagevalue of such a ratio across all the bars of a song as asimple measure of melody matchness.•

Chord matchness . The chroma vector (Fujishima 1999)represents a short-time fragment of music by the distri-bution of energy across the 12 pitch classes ( C , C , etc)and offers a simple way to evaluate the harmonic simi-larity between two fragments. We calculate the segment-wise cosine similarity between the chroma vector repre-senting each chord label of a lead sheet (which would bebinary-valued) and the chroma vector of the correspond-ingly generated piano segment (normalized by the maxi-mum value so it is ∈ [0 , ), and treat the average valueacross time as a measure of chord matchenss.Table 4 shows that the evaluated models all have match-ness close to that of the training set, and much higher thanthat of the random baseline (i.e., the average matchness be-tween a lead sheet and a random song from the test set). Thissuggests, while CP + linear is easier and faster to train thanREMI + XL, they may generate music of similar quality. Wefurther investigate this through a user study, which directlyassesses the perceptual quality of the generated music.

Qualitative Evaluation

We devise an online questionnaire that solicits anonymousresponse to the music generated by different models for boththe conditional and unconditional settings. For the former,we present excerpts of 32 bars taking from one-third loca-tion of the music. For the latter, we present the full songs (a) REMI + XL(b) CP + linearFigure 3: Piano-rolls of middle 64 bars of random generatedpieces of two models in the unconditional setting. We seericher and diverse content in the result of CP + linear.(i.e., when an [EOS] token is generated). Our intention is toinvestigate whether CP + linear and REMI + XL indeed gen-erate music of similar perceptual qualities.The generated music is rendered into audio with a pi-ano synthesizer using a free, non-professional grade soundfont. Each batch comprises the result of the evaluated mod-els in random order. A subject has to rate the music for threerandom batches for each setting separately, in terms of thefollowing aspects on a ﬁve-point Likert scale. 1)

Fidelity :is the conditionally generated piece similar to the refer-ence, from which the condition lead sheet was taken from?2)

Richness : diversity and interestingness. 3)

Humanness :does the piece sound like expressive human performances?4)

Correctness : perceived absence of composing or playingmistakes. 5)

Structureness : whether there are structural pat-terns such as repeating themes or development of musicalideas. 6)

Overall . As the music can be long, the question-naire may take around 30 mins to complete.Table 5 shows the average result from 18 subjects. We seethat REMI + XL performs the best in the conditional setting,yet with only moderate performance gap between the mod-els. In contrast, CP + linear performs (slightly) better con-sistently across the four metrics in the unconditional setting,suggesting it a powerful alternative to REMI + XL.

Conclusion

In this paper, we have presented a new variant of the Trans-former that processes multiple consecutive tokens at once ata time step. Each individual token is associated with a tokentype, which is exploited by the model to customize its inputand output modules. The proposed model achieves sequencecompression by integrating the embeddings of the tokens,which can be seen as forming a hyperedge over a dynamicgraph. We show that the new Transformer works remarkablywell for modeling music, creating full-song piano of compa-rable perceived quality with a competing Transformer-XLbased model in much shorter training and inference time. It turns out that the REMI + XL model seldom generates [EOS]tokens even when the music is already quite long (e.g., 8 minutes),so we stop it each time when it has generated 7,680 tokens. In the conditional setting, the global structure of the song to begenerated is fairly outlined in the given condition (i.e., the melody).Thus, it seems sufﬁcient for models to learn from short segments. thics Statement

Research on automatic music generation may infringe copy-right laws and may raise concerns regarding the role of hu-man musicians in the future. Cares have to be given regard-ing the fair use of existing musical material for model train-ing, and the potential concern of “deepfaking” an existingartist’s style in computer-generated music.

Acknowledgement

We are grateful to our interns at the Taiwan AI Labs, JoshuaChang for developing the symbolic-domain chord recog-nition algorithm, and Yu-Hua Chen and Hsiao-Tzu Hungfor helping organize the PyTorch code. We also thank theanonymous reviewers for their valuable comments.

References

Ackley, D. H.; Hinton, G. E.; and Sejnowski, T. J. 1985. Alearning algorithm for Boltzmann machines.

Cognitive Sci-ence arXiv preprintarXiv:1809.10853 .B¨ock, S.; Korzeniowski, F.; Schl¨uter, J.; Krebs, F.; and Wid-mer, G. 2016. Madmom: A new Python audio and mu-sic signal processing library. In

Proc. ACM Multimedia ,1174–1178.Chen, Y.-H.; Huang, Y.-S.; Hsiao, W.-Y.; and Yang, Y.-H.2020. Automatic composition of guitar tabs by Transformersand groove modeling. In

Proc. Int. Soc. Music InformationRetrieval Conf.

Child, R.; Gray, S.; Radford, A.; and Sutskever, I. 2019.Generating long sequences with sparse Transformers. arXivpreprint arXiv:1904.10509 .Choi, K.; Hawthorne, C.; Simon, I.; Dinculescu, M.; and En-gel, J. 2020. Encoding musical style with transformer au-toencoders. In

Proc. Int. Conf. Machine Learning .Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.; andSalakhutdinov, R. 2019. Transformer-XL: Attentive lan-guage models beyond a ﬁxed-Length context. In

Proc. An-nual Meeting of the Association for Computational Linguis-tics , 2978–2988.Donahue, C.; Mao, H. H.; Li, Y. E.; Cottrell, G. W.; andMcAuley, J. 2019. LakhNES: Improving multi-instrumentalmusic generation with cross-domain pre-training. In

Proc.Int. Soc. Music Information Retrieval Conf. , 685–692.Dong, H.-W.; Chen, K.; McAuley, J.; and Berg-Kirkpatrick,T. 2020. MusPy: A toolkit for symbolic music generation.In

Proc. Int. Soc. Music Information Retrieval Conf.

Ens, J.; and Pasquier, P. 2020. MMM- Exploring conditionalmulti-track music generation with the Transformer. arXivpreprint arXiv:2008.06048 .Feng, Y.; You, H.; Zhang, Z.; Ji, R.; and Gao, Y. 2019. Hy-pergraph neural networks. In

Proc. AAAI , 3558–3565. Fujishima, T. 1999. Realtime chord recognition of musicalsound: A system using common Lisp. In

Proc. InternationalComputer Music Conf. , 464–467.Hawthorne, C.; Elsen, E.; Song, J.; Roberts, A.; Simon, I.;Raffel, C.; Engel, J.; Oore, S.; and Eck, D. 2018a. Onsetsand Frames: Dual-objective piano transcription. In

Proc. Int.Soc. Music Information Retrieval Conf. , 50–57.Hawthorne, C.; Huang, A.; Ippolito, D.; and Eck, D. 2018b.Transformer-NADE for piano performances. In

Proc. Ma-chine Learning for Creativity and Design Workshop .Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; and Choi, Y.2020. The curious case of neural text degeneration. In

Proc.Int. Conf. Learning Representations .Huang, C.-Z. A.; Vaswani, A.; Uszkoreit, J.; Simon, I.;Hawthorne, C.; Shazeer, N.; Dai, A. M.; Hoffman, M. D.;Dinculescu, M.; and Eck, D. 2019. Music Transformer: Gen-erating music with long-term structure. In

Proc. Int. Conf.Learning Representations .Huang, Y.-S.; and Yang, Y.-H. 2020. Pop Music Trans-former: Beat-based modeling and generation of expressivePop piano compositions. In

Proc. ACM Multimedia .Jhamtani, H.; and Berg-Kirkpatrick, T. 2019. Modeling Self-Repetition in Music Generation using Generative Adversar-ial Networks. In

Proc. Machine Learning for Music Discov-ery Workshop .Jiang, J.; Wei, Y.; Feng, Y.; Cao, J.; and Gao, Y. 2019. Dy-namic hypergraph neural networks. In

Proc. IJCAI , 2635–2641.Jiang, J.; Xia, G. G.; Carlton, D. B.; Anderson, C. N.; andMiyakawa, R. H. 2020. Transformer VAE: A hierarchicalmodel for structure-aware and interpretable music represen-tation learning. In

Proc. Int. Conf. Acoustics, Speech andSignal Processing , 516–520.Katharopoulos, A.; Vyas, A.; Pappas, N.; and Fleuret, F.2020. Transformers are RNNs: Fast autoregressive Trans-formers with linear attention. In

Proc. Int. Conf. MachineLearning .Kazemi, S. M.; Goel, R.; Jain, K.; Kobyzev, I.; Sethi, A.;Forsyth, P.; and Poupart, P. 2020. Representation learningfor dynamic graphs: A survey.

Journal of Machine LearningResearch arXiv preprintarXiv:2006.15595 .Keskar, N. S.; McCann, B.; Varshney, L. R.; Xiong, C.; andSocher, R. 2019. CTRL: A conditional Transformer lan-guage model for controllable generation. arXiv preprintarXiv:1909.05858 .Kivel¨a, M.; Arenas, A.; Barthelemy, M.; Gleeson, J. P.;Moreno, Y.; and Porter, M. A. 2014. Multilayer networks.

Journal of Complex Networks

Proc. Int. Soc. MusicInformation Retrieval Conf. ore, S.; Simon, I.; Dieleman, S.; Eck, D.; and Simonyan,K. 2018. This time with feeling: Learning expressive musi-cal performance.

Neural Computing and Applications .Patel, A. D. 2003. Language, music, syntax and the brain.

Nature Neuroscience

6: 674–681.Payne, C. M. 2019. MuseNet.

OpenAI Blog .Rae, J. W.; Potapenko, A.; Jayakumar, S. M.; Hillier, C.; andLillicrap, T. P. 2020. Compressive Transformers for long-range sequence modelling. In

Proc. Int. Conf. Learning Rep-resentations .Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.;Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Ex-ploring the limits of transfer learning with a uniﬁed text-to-text Transformer.

Journal of Machine Learning Research arXiv preprintarXiv:2006.10637 .Uitdenbogerd, A.; and Zobel, J. 1999. Melodic matchingtechniques for large music databases. In

Proc. ACM Multi-media , 57–66.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In

Proc. Advances in Neural Infor-mation Processing Systems , 5998–6008.Wu, S.-L.; and Yang, Y.-H. 2020. The Jazz Transformer onthe front line: Exploring the shortcomings of AI-composedmusic through quantitative measures. In

Proc. Int. Soc. Mu-sic Information Retrieval Conf.

Yang, L.-C.; and Lerch, A. 2020. On the evaluation of gener-ative models in music.