Hierarchical Recurrent Neural Networks for Conditional Melody Generation with Long-term Structure
HHierarchical Recurrent Neural Networks forConditional Melody Generation with Long-termStructure
Guo Zixun
Information Systems,Technology, and DesignSingapore Universityof Technology and Design
Singaporenicolas [email protected]
Dimos Makris
Information Systems,Technology, and DesignSingapore Universityof Technology and Design
Singaporedimosthenis [email protected]
Dorien Herremans
Information Systems,Technology, and DesignSingapore Universityof Technology and Design
Singaporedorien [email protected]
Abstract —The rise of deep learning technologies has quicklyadvanced many fields, including that of generative music systems.There exist a number of systems that allow for the generationof good sounding short snippets, yet, these generated snippetsoften lack an overarching, longer-term structure. In this work,we propose CM-HRNN: a conditional melody generation modelbased on a hierarchical recurrent neural network. This modelallows us to generate melodies with long-term structures basedon given chord accompaniments. We also propose a novel, conciseevent-based representation to encode musical lead sheets whileretaining the notes’ relative position within the bar with respectto the musical meter. With this new data representation, theproposed architecture can simultaneously model the rhythmic,as well as the pitch structures in an effective way. Melodiesgenerated by the proposed model were extensively evaluatedin quantitative experiments as well as a user study to ensurethe musical quality of the output as well as to evaluate if theycontain repeating patterns. We also compared the system withthe state-of-the-art AttentionRNN [1]. This comparison showsthat melodies generated by CM-HRNN contain more repeatedpatterns (i.e., higher compression ratio) and a lower tonal tension(i.e., more tonally concise). Results from our listening test indicatethat CM-HRNN outperforms AttentionRNN in terms of long-term structure and overall rating.
Index Terms —Hierarchical RNN, Recurrent neural network,RNN, Generative model, Conditional model, Music generation,Event-based representation, Structure
I. I
NTRODUCTION
At the dawn of computing, the idea of generating musicwas first conceived by Lady Ada Lovelace, when she said:‘[The Engine’s] operating mechanism might act upon otherthings besides numbers [. . . ] Supposing, for instance, thatthe fundamental relations of pitched sounds in the signs ofharmony and of musical composition were susceptible of suchexpressions and adaptations, the engine might compose elabo-rate and scientific pieces of music of any degree of complexityor extent.’ Ada Lovelace, as quoted in [2]. When computerswere first created, it was not long until they were used to
This work is funded by Singapore Ministry of Education Grant no.MOE2018-T2-2-161. generate the first piece of music [3]. The field of computergenerated music has grown ever since [4], with great stridesbeing made in recent years due to deep learning technologies[5]. Computer generated music, however, is not yet a part ofour daily music listening experience. One potential reasoncould be the lack of repeating patterns or themes, i.e. alonger-term structure, which is a necessity for the presence of‘earworms’ in music [6]. There are two very different typesof music generation systems: those that generate symbolicmusic and those that generate raw audio. Some attempts havebeen made to generate music directly as raw audio [7], [8].This remains a challenge, however, due to the disproportionateamount of audio samples in the overall musical structure.Hence, much of the existing music generation research focuseson the symbolic domain where music is represented as a seriesof musical events in sequence. Popular data encoding schemesfor symbolic music include piano-roll representation [9], [10],tonnetz representation [11], word embeddings [12], [13], andevent-based representation [14]. In this work, we will focuson the symbolic approach, more specifically, we will generatemelodies from their respective chord accompaniments usinga novel event-based representation. Compared to the existingevent-based representations, we explicitly encode bar eventinformation in our representations. This allows our model tounderstand musical complexities such as meter [15].Composing melodies from a given set of chords is a taskfaced by many musicians in the real-world, e.g. in pop musiccomposition or jazz improvisation. We are thus motivated todesign our model to receive a chord sequence as conditioninginput by the users and generate melodies based on theseprovided chords. In this way, the users will have a certain levelof control over the generated melody through the manipulationof the input chord sequence.Since music data is sequential, researchers have recentlyfocused on developing recurrent neural networks (RNN) andtheir variants (e.g., long-short term memory (LSTM) / gatedrecurrent units (GRU)) for music generation due to their a r X i v : . [ c s . S D ] F e b emory mechanism [5]. Still, current computer generatedmusic often lacks long term structure, an essential qualityof polished, complete musical pieces with recurring themes[4], [16], [17]. Even if it is speculated by researchers that thegenerated music may include long-term structure, very rarelya quantitative measure is employed to measure this [11]. Byinjecting a hierarchical structure between the RNN layers ofour proposed model, we allow the model to capture musicalpatterns on different time scales. In this work, we proposeCM-HRNN, a conditional melody generation model based ona hierarchical recurrent neural network. In Section III and IV,we describe the proposed event representation and CM-HRNNarchitecture in detail. We then thoroughly analyze the musicgenerated by CM-HRNN in terms of musical quality and thepresence of repeated patterns in Section V. The latter is doneby calculating the compression ratio [11] measure. Finally,these results are confirmed in a listening study. In the nextsection, we first give a brief overview of the related work.II. R ELATED WORK
We will start by giving an overview of existing musicgeneration systems with a focus on long-term structure. Thenwe will dive into hierarchical models which have shown tobe effective in capturing long-term dependencies in multiplefields such as natural language processing [18], audio genera-tion [7], [19], and music generation [20], [21].
A. Music generation systems with long-term structure
The problem of generating music with long-term structurehas received limited attention. The affective music genera-tion system MorpheuS [22] enforces repeated patterns andstructure into music by considering music generation as acombinatorial optimization problem. MorpheuS uses repeatedpatterns from existing templates as hard constraints duringgeneration. Looking towards machine learning methods, [16]use a Markov model that learns the statistical properties ofa musical corpus, but combine this with a variable neighbor-hood search optimization algorithm to generate music, whilehard constraining a larger structure. Similarly, [23] train aconvolutional restricted Boltzmann machine (C-RBM), but usesimulated annealing as a sampling technique. This allows themto include structural constraints.Looking at ‘pure’ neural networks, that do not leverage(often) slower optimization techniques, we find [11], who usea novel Tonnetz representation to train an LSTM that is betterable to generate polyphonic music with repeated patterns thana similar network with a more traditional piano roll represen-tation. Two recent RNN-based systems, LookbackRNN andAttentionRNN [1], generate monophonic melodies with longterm structure, in an autoagressive way. Both models incor-porate the “repetitive label” in the data representation, a wayto represent repetition of notes in the neighbouring two bars.LookbackRNN [1] contains a two-layer LSTM network withresidual connections between the different time steps, whichallows the model to feed note events from the previous 1 to 2bars to the current time step. The residual connection is able to offset the impact of vanishing gradients and allows the modelto generate more repetitive patterns. Similar to LookbackRNN,AttentionRNN [1] also inputs previous event information intothe current generation step, however, it applies a learnableattention mask to the previous n events which will decide howmuch attention to put on each previous event. By explicitlyfeeding the current generation step the neighbouring events,the model will be able to learn when recent musical repetitionoccurs. Each generation step, however, may not be aware of“the bigger picture”.Another model is the MusicVAE by [21], which generates(smooth) transitions between two musical fragments withoverall long-term structures. In their variational autoencoder,the input music data is first encoded by a Bidirectional (Bi-)RNN encoder which generates a latent vector z . The latentvector z will then be fed as a initial state to an RNNwhich generates a series of “conductors” which will then“conduct” the hierarchical RNN decoder to generate musicsequences. Since each “conductor” conditions the generationof multiple steps, the generated music is shown to containsmore long-term structures compared to other existing recurrentvariational autoencoders (VAEs). Even though the problemdomain is different, we can still gain inspiration from theconditioning “conductors” and the hierarchical decoders whichwould enhance the overall long-term structure in music.Hierarchical RNN (HRNN) [20] aims to generate mono-phonic melodies with long term structure. It first generatesbar profiles and beat profiles which indicate the rhythmicpattern within a bar and a beat using vanilla LSTMs. Beatprofile generation is directly conditioned by the latent outputsfrom the upper RNN layer which generates the bar profile.These two profiles then jointly condition the pitch generation.By combining the generated pitch and rhythmic pattern, thegenerated monophonic melodies are qualitatively evaluatedthrough multiple listening tests and are shown to outperformthe generated outputs by the LookbackRNN [1]. The coredesign of the architecture is to first generate the rhythmicpatterns and condition the pitch generation with both coarseand fine rhythmic patterns afterwards.In this paper, we propose an architecture similar to HRNNbut with different design motivations. We believe that rhythmicand pitch features are equally important and hierarchical struc-tures should exist in both of these domains. Hence, instead ofconditioning the pitch generation with the generated rhythmicpatterns from vanilla LSTMs, we apply the hierarchical RNNstructure to both rhythmic and pitch latent spaces simulta-neously. Compared to HRNN, our proposed CM-HRNN isable to condition the melody generation with provided chords.Moreover, the musical quality of HRNN-generated music isonly evaluated through subjective listening test. In contrast,we evaluate our proposed CM-HRNN with extensive analyticalmeasures as well as a user study. B. Hierarchical architectures for long term dependencies
Drawing inspiration from other fields in which long-termdependencies of sequential data have been modelled by RNNs,e find that multiple hierarchical architectures have beenproposed for this challenge [18], [19], [24]–[26]. These modelstry to control the weight update rate within the RNN cells.RNN weights matrices are separated into different chunks andwill be updated using different rates: high update rates willcapture the short-term dependencies whereas low update rateswill capture long-term dependencies.These architectures can be applied to the domain of audiogeneration. For instance, SampleRNN [7] is able to synthesiserealistic sounding audio with a hierarchical RNN structure.Instead of manipulating the RNN weights update rate tocapture the temporal and long-term structure of sequences ofaudio samples, it uses multiple stacks of RNN with upper tierRNN layers operating on more grouped audio samples perstep and lower tier RNN layers operating on fewer groupedsamples per step. Inspired by SampleRNN, we group ourdata (i.e., music events) in different resolutions and processthese groups of data separately to obtain the coarse-to-finehierarchical features of the data.III. E
VENT - BASED REPRESENTATION WITHUNDERSTANDING OF METER
We propose a novel data encoding scheme based on [27]which uses two combined one-hot encoded sub-vectors torepresent the pitch and duration of music events. Additionally,we extend their representation with another three sub-vectors.As a result, each music event vector consists of the followingsub-vectors: • Pitch:
The MIDI standard defines 128 pitches. We addtwo additional elements to this sub-vector: one to indicatea rest; and one to indicate if a note is sustained or not (i.e.,tied note). This results in a 130-dimensional sub-vector. • Duration:
A 16-dimensional vector which representsduration. Possible values range from a 16th note to awhole note with an increment of 16th note duration (i.e.smallest quantization unit). • Current and Next Chord:
There are 12 possible chordroots (from C to B). For each root we set 4 possible chordtypes (i.e. major, minor, diminished, and dominant 7th)plus the rest symbol. This results in two 49-dimensionalvectors. Since the chord information is used to guide themelody generation, both the current chord as well as thenext chord are provided as input to the network at eachgiven time. • Bar:
A two-dimensional vector for indicating if it is thestart of a bar or not.An example of these sub-vectors can be found in Figure 1.All sub-vectors are concatenated together while the inclusionof multiple types of data, or viewpoints, such as bar informa-tion is inspired by [28]. Finally, an example of illustrating theproposed lead sheet encoding is shown in Figure 2.Since bar lines are indicated in the proposed notation, wecan calculate the accumulated time information for each eventwithin the bar as shown in Table I which will then be fed toour proposed model in the bottom generation tier. Explicitlyfeeding the model the accumulated time information is proved Fig. 1: Example of the sub-vectors for the starting C note.to be effective in terms of duration and bar event predictionin Section V.
Each color-coded block denotes a one-hot vector. For illustration purposes, only theevent symbol is listed in the color-coded block. Readers can refer to the sub-vectorwith the same colour in Figure 1 for the corresponding one-hot vector representation.
Fig. 2: Proposed event-based representation for the first threebars of “autumn leaves”.TABLE I: Accumulated time information for the first threebars of “autumn leaves”.
Event number 1 2 3 4 5 6 7 8 9Start of bar yes no no no yes yes no no noNote duration 1 1 1 1 4 1 1 1 1 acc t
Duration is expressed in units of one quarter note here. “acc t” is accumulated time.
There are several advantages to this representation. Firstly,pitch repetition and duration repetition are mutually inclusive(i.e. they can co-exist or occur separately) [29]. For example,a melodic motif may be repeated several times in a jazzpiece with slight rhythmic variations. In other cases, repeti-tive patterns may exist in the duration vector space but notn the pitch vector space. Hence, using multi-hot encodeddata enables RNNs to learn different high-level features andpatterns separately. Secondly, this representation can encodedata efficiently. Compared to the piano-roll representation, ourproposed encoding can easily represent two consecutive noteswith the same pitch whereas one may not be able to differen-tiate between a long note or two repeated notes in the piano-roll representation [10] unless additional onset information isadded to capture the repetitive onset which could potentiallyincrease the computational cost. Moreover, given that thesmallest duration value used in the quantization step is a 16thnote in this research, a piano roll representation would need 32event vectors to represent two tied whole notes, whereas ourproposed representation only needs 2 event vectors: [wholenote with a duration of 4 beats, tied note with a duration of 4beats]. Thirdly, it allows us to incorporate an indication of barevents, which is crucial to musical composition and will allowthe system to learn the relative positioning within bars. Lastbut not least, the proposed data representation could easily beextended to include other features (e.g., velocity). This is amajor difference between our work and [27].IV. P
ROPOSED MODEL : CM-HRNN
A. Model Architecture
Our proposed CM-HRNN architecture consists of multiplehierarchical tiers (see Figure 3). All tiers except for the bottomtier use LSTM cells to process grouped event data. Upper tiersprocess more grouped event data per step to capture the long-term dependencies. The latent outputs from the upper tiers,which represent a coarser latent representation of sequentialdata directly condition the lower tiers’ generation after properupsampling. The bottom tier, after receiving fine-to-coarseconditioning vectors from the upper tiers, proceeds to predictfuture music events. Inspired by [7], the CM-HRNN’s bottomtier uses a conv1d operation to process overlapping slidingwindows of n tier events ( indicates the bottom tier) andgenerate the next event. To convert latent outputs from thesesliding windows to music events, we include a predictivenetwork in the bottom tier (see Figure 4). In this paper, wefocus on two variants of the proposed architecture: a 2-tierCM-HRNN and a 3-tier CM-HRNN. In theory, we couldincorporate as many tiers as possible, but to avoid overlyincreasing the model’s complexity, we will restrict the numberof tiers to a maximum of three. The architecture of theproposed 3-tier CM-HRNN is shown in Figure 3. By removingthe top tier of 3-tier CM-HRNN, we obtain the architecture of2-tier CM-HRNN.In the 3-tier CM-HRNN, residual connections exist betweenthe top tier and the bottom tier to alleviate the effect of thevanishing gradient. This also allows the bottom generation tierto receive the latent conditioning outputs from all of the uppertiers directly. Additionally, we also feed the accumulated timeinformation to the predictive network in the bottom generationtier. The accumulated time vector has the same representationas the duration vector (see Section III, Table I). Given thatthe position of each bar line is indicated in our proposed data representation, the model is able to quickly learn therelationship between the accumulated time information, theduration vector, and the bar event vector, which makes themodel more aware of relative positioning information.In the upper tiers of the model, events are grouped intonon-overlapping event frames. Each such frame contains F S k events. Whereas in the bottom tier, events are grouped intooverlapping event frames with a stride of 1 (i.e., sliding win-dows). Here, k indicates the tier number and F S k representsthe frame size of tier k . These event frames will be processedby their respective tier in order. The relationship between each F S k and tier is as follows: F S k =1 = F S k =2 (1) F S k +1 mod F S k = 0 (2) F S k +1 > F S k if k > (3)We group events into event frames to form the input for thedifferent tiers. Here, e t indicates the t -th event. All e t s withinone square bracket are grouped into one event frame. f rames k = (cid:40) [ e ...e F S k ] , [ e F S k +1 ...e · F S k ] ..., if k (cid:54) = 1[ e ...e F S k ] , [ e ...e F S k +1 ] ..., if k = 1 (4)We define the input for each tier’s processing unit (2-layerLSTM or conv1d) i k as follows: i k = (cid:40) f rames k , if k = top tier or W f f rames k + W o i o k +1 , otherwise (5)Different tiers receive input at different rates, hence, theoutputs from the upper tier LSTMs need to be upsampledbefore they are used to as a condition in the lower tiergeneration. In the formula below, h indicates the hidden statesof the LSTM; o t k indicates the intermediate outputs for tier k at the t -th time step; o t k represents the upsampled LSTMoutputs, and acc t the accumulated time information at timestep t , ⊕ is used for vector concatenation, and all W aretrainable weights. Bias terms and repeated layers are omittedin the equation for simplicity. Residual connections betweentiers are implemented as per Eq.8. o t k , h t k = LST M ( i t k , h t − k ) , if k (cid:54) = 1 (6) o t k = W o upsample o t k , if k (cid:54) = top tier or (7) o t = ( W i i t + k top (cid:88) k =2 o t k ) ⊕ acc t (8)To predict the next event at the t + 1 -th time step, weinput the intermediate output o t to a predictive network asper Fig. 4. p t , d t , and b t represent the pitch, duration, and barsub-vectors of an event respectively. Eq.8 to Eq.12 representthe predictive network, whereby FC can consist of multipleig. 3: Our proposed CM-HRNN architecture.Fig. 4: Details of the predictive network within CM-HRNN.FC(pitch) represents 2 fully connected layers (130 nodes each).FC(duration) represents 2 fully connected layers (16 nodeseach). Finally, FC(bar) is a 1-layer fully connected network.All FCs have ReLu activation.layers (as per Fig. 4) and has ReLu activation. For simplicity,we did not include activation information in the equations. Wethen apply three softmax functions to obtain the probability distribution for each of these three sub-vectors. p t = F C p ( o t ) (9) d t = F C d ( o t ) (10) b t = F C b ( d t ) (11) ˆ pitch t ∼ Sof tmax ( p t ) (12) ˆ duration t ∼ Sof tmax ( d t ) (13) ˆ bar t ∼ Sof tmax ( b t ) (14)We calculate the cross-entropy loss for each of the threevectors (pitch, duration, and bar), and summarise them to formthe final training objective function. Different weights α , α ,and α are applied to pitch loss, duration loss, and bar lossrespectively: Loss = α · CE ( ˆ pitch t , pitch t ) (15) + α · CE ( ˆ duration t , duration t )+ α · CE ( ˆ bar t , bar t ) B. Implementation details
We have implemented several models which will be evalu-ated in Section V. All models utilise a two-layer LSTM with256 nodes as the memory unit. The sliding window in thebottom tier is implemented as a conv1d operation with a strideof 1. The filter for the convolution has a filter size of
F S (varies per experiment) and an output dimension of 256. In thepredictive network, we use 2 fully connected (FC) layers (130nodes each) for pitch prediction; 2 FC layers (16 nodes each)for duration prediction; and 1 FC layer (2 nodes each) for barprediction. All FC layers are equipped with ReLu activation.e set α = 0 . , α = 0 . , and α = 0 . for the loss function.To generate music for our experiments, we provide the first 16pitch events and all chord events of all songs in the test set asthe model input. The sampling temperatures for the pitch andduration and bar output distribution are set to be 0.7, 0.2, and0.1 respectively. The duration sampling temperature is set lowwhereas the pitch sampling temperature is set high because wewant to keep the timing correct while giving more freedom topitch generation. V. E XPERIMENTS
A. Experimental setup
We set up several experiments to determine the optimalnetwork architecture, as well as the model’s ability to generatehigh-quality music with structure, and to compare it with astate-of-the-art system, AttentionRNN.In a first experiment, we test whether accumulated timeinformation ( acc t ) could help the model to generate bar eventsand thus understanding the meter. Hence, we have imple-mented 2 sets of models: with or without the accumulated timeinformation ( acc t ) fed as input. Next, we test the influenceof the residual connection and the frame size F S in the 3-tier CM-HRNN, on the model’s ability to include repeatedpatterns in the generated music, as well as keep track ofthe musical meter. Finally, the best 2- and 3-tier CM-HRNNwas compared to AttentionRNN, both analytically, as well asthrough a listening test. While it is not always easy to comparedifferent music generation systems, given that they often haveslightly different functional tasks or input representations, acomparison with AttentionRNN was facilitated through theiruse of multi-hot encoded input data which is similar to ourdata representation. Below, we first describe our dataset andthe pre-processing method, followed by a description of theevaluation metrics and the listening test setup. B. Dataset and pre-processing
All training data (XML format) was parsed from Theorytab[30], an open-source website that provides lead sheets. EachXML file contains a label with the song genre, song sections,note events, and chord events with absolute timing. During thedata encoding stage, we changed the absolute timing into deltatiming due to the nature of the LSTM cells. For simplicity,we have chosen to include only monophonic melodies with atime signature of 4/4, along with their chords. All songs weretransposed to C. We have merged all available sections (i.e.,intro, verse, pre-chorus, chorus, and outro) of the same songto make the training sequences long enough. We matched eachnote event with its corresponding chord and labeled the noteevent sequences with the bar label. In the end, we obtained5,507 training musical fragments which were then randomlysplit into training, validation, and test set with the ratio of 0.8,0.1, and 0.1 respectively.
C. Analytical evaluation measures1) Compression ratio to measure long-term structure:
Weuse COSIATEC [31] as a proxy to evaluate the long term structures contained in the generated music. COSIATEC uti-lizes a geometric approach much like zip-file compression, todetect repeated patterns in symbolic music data. The resultingcompression ratio indicates how much smaller a musical filecan be made by representing it with pattern vectors and theircorresponding translation vectors. Hence, this compressionratio reflects the number of repeated patterns that the generatedmusic contains. In the experiments, we only calculate thecompression ratio of the generated melodies .
2) Tension measures:
We use the model for tonal tensionby [32] based on the spiral array [33], to measure the amountof tension in the generated music. This model offers threemeasures for each time frame of music (or cloud): clouddiameter, tensile strain, and cloud momentum. Cloud diameterindicates the tonal dissonance within a cloud, tensile strainmeasures the tonal distance to the key of the song, and cloudmomentum calculates the tonality movement between differentclouds. Even though a limited amount of dissonant notes couldsound more musically interesting, these could also adverselyaffect the music quality. By calculating these tension measureswe have a direct impression of whether the generated musicis tonally consistent.
D. Listening test setup
An online listening test was conducted to evaluate theproposed model subjectively. Each participant was asked tolisten to 12 musical fragments ranging from 13 to 44 secondsgenerated by 3 models: 2-tier CM-HRNN, 3-tier CM-HRNN,and AttentionRNN (with an attention size of 32). Thesesnippets of music are placed in random order and the leadsheets were shown during playback. All participants wereasked to rate the following questions on a scale from 1 (verypoor) to 5 (very good):1) overall perception of musical quality;2) coherence of the music;3) consonance between chord and melody;4) naturalness of the generated melody in terms of pitch;5) naturalness of the generated melody in terms of duration.VI. R
ESULTS
A. Ability to learn correct bar timing and effectiveness of theaccumulated time information
To evaluate the influence of adding the accumulated timeinformation in the bottom generation tier on the predictedduration of each note, as well as the correct bar placement,we have trained 4 variants of our model (see Table II): amodel with 2-tiers and 3-tiers, each with and without theaccumulated time information added as input. Intuitively, ifthe model can accurately predict when a new bar starts, andthus partly understands meter, we hope that it is better able togenerate meaningful rhythm. The validation loss for modelswith and without accumulated time information is shown inFigure 5. It is obvious that by adding the accumulated timeinformation, the models’ validation loss, both when predictingthe bar as well as the duration event, is lower than for themodels without the accumulated time information. a) Bar event validation loss. (b) Duration event validation loss.
Fig. 5: Evaluation of the validation loss of different models,with and without added accumulated time information. Modelconfigurations are shown in Table IAdditionally, we calculate the ratio of successfully pre-dicted bar events among all predicted bars (successful barratio), no. of bars with beatstotal no. of bars from the results generated byeach model. We also compare the compression ratios ofthe pieces generated by each model. The results for eachmodel configuration are shown in Table II. When comparingthe successful bar ratios between models with and withoutadded accumulated time information, we see that the formerare almost always able to accurately predict when a newbar should start. In other words, they are able to partlyunderstand musical meter. Also, models with accumulatedtime information achieve a much higher compression ratiocompared to those without the accumulated time informationwhich indicates that the former can generate music with morerepeated patterns.TABLE II: Results of models with and without accumulatedtime information. Model
F S F S acc t SBR CPR2-tier CM-HRNN 16 n.a. no 91.0% 1.612-tier CM-HRNN 16 n.a. yes acc t : accum. time information; SBR: successful bar ratio; CPR: compression ratio B. Residual connections and
F S in 3-tier CM-HRNN To validate the effectiveness of adding residual connectionsin the 3-tier CM-HRNN, we have implemented six modelvariants, some with and without residual connections. To findthe optimal
F S in a 3-tier model setting, we froze F S to be 16 and experimented with different F S . The modelconfigurations and results are shown in Table III. From theresults, given the same F S , it is clear that the residualconnection can increase the overall compression ratio of thegenerated melodies. This is an indication that the generatedmusic might contain more repeated themes and might havea larger overall structure [11]. Meanwhile, the successfulbar ratio remains high for all 3-tier models regardless of F S . From these results, we choose the 3-tier model with F S = 2 , F S = 16 , and with residual connection as ourbest performing model.TABLE III: Comparing different F S in a 2-tier model. F S F S Residual connections SBR CPR2 16 yes 100.0%
SBR: successful bar ratio; CPR: compression ratio
C. Comparison with AttentionRNN
We compare our best performing model with the state-of-the-art model AttentionRNN [1]. This comparison wasfacilitated by the fact that their input representation is alsomulti-hot encoded, even though their original representationincludes repetitive labels which indicate whether the note eventwas repeated 1 or 2 bars ago. Secondly, our problem domain isidentical: modeling long-term coherence in music generation.Even though other researches may share some similarities interms of model architecture [20], [21], they work on a differentproblem domain or with different data representation, thusmaking comparison hard.We implemented two AttentionRNN models with the sameLSTM setting: 2 layers (each with 256 nodes). One modelhas an attention lookback size of 16 and the other has anattention lookback size of 32. Other than the compression ratioand successful bar ratio, we also compare the tension of thegenerated music from all these models. We invite the readerto listen to some generated pieces online .A comparison between the calculated measures for gener-ated pieces by both our proposed CM-HRNN and Attention-RNN is shown in Table IV. From these results, we see that mu-sic generated by CM-HRNN has a much higher compressionratio compared to AttentionRNN. The tension values indicatethat music generated by AttentionRNN sounds more tense anddissonant, with less clear movements in tonality.TABLE IV: Analytical measures for both CM-HRNN andAttentionRNN. Meter Structure TensionModel SBR CPR CD TS CMCM-HRNN(3t) 100.0% ± ± ± ± ± ± AttentionRNN(16) 88.2% 1.58 2.22 ± ± ± ± ± ± SBR: successful bar ratio; CPR: compression ratio;CD: cloud diameter; TS: tensile strain; CM: cloud momentum
D. Listening test
A total of 41 participants participated in the listening test.The resulting ratings in Table V show that our proposed M-HRNN outperforms AttentionRNN in terms of overallenjoyment rating and long-term coherence, which again provesthe effectiveness of our proposed model in capturing the long-term structure of music data.TABLE V: Listening test rating results on a scale from 1 to 5.
Overall Pitch DurationModel Rating Coherence Consonance naturalness naturalnessCM-HRNN (3t)
AttentionRNN (32) 2.80 2.85 2.95 2.98 2.79
VII. C
ONCLUSION
We propose a novel conditional hierarchical RNN network,CM-HRNN to generate melodies conditioned with chords. Inaddition to using a novel, effective event-based representationthat explicitly encodes bar information, CM-HRNN generatesmusically sound melodies that contain long-term structures.The source code of the CM-HRNN implementation is availableonline . In extensive experiments, both using calculated fea-tures as well as a listening test, we show that pieces generatedby CM-HRNN have greater tonal stability and more repeatedpatterns than those generated by AttenionRNN.R EFERENCES[1] E. Waite, “Generating long-term structure in songs and stories,”2016. [Online]. Available: https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn[2] D. Herremans and K. S¨orensen, “Composing fifth species counterpointmusic with a variable neighborhood search algorithm,”
Expert systemswith applications , vol. 40, no. 16, pp. 6427–6437, 2013.[3] L. A. Hiller Jr and L. M. Isaacson, “Musical composition with a highspeed digital computer,” in
Audio Engineering Society Convention 9 ,1957.[4] D. Herremans, C.-H. Chuan, and E. Chew, “A functional taxonomy ofmusic generation systems,”
ACM Computing Surveys (CSUR) , vol. 50,no. 5, pp. 1–30, 2017.[5] J.-P. Briot, G. Hadjeres, and F.-D. Pachet,
Deep learning techniques formusic generation . Springer, 2020.[6] J. A. Burgoyne, D. Bountouridis, J. van Balen, and H. Honing, “Hooked:a game for discovering what makes music catchy,” in
Proc. of the 14thInt. Society for Music Information Retrieval Conf., ISMIR, Curitiba,Brazil, November 4-8, 2013 , pp. 245–250.[7] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,A. Courville, and Y. Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” in
Proc. of the 5th Int. Conf. onLearning Representations, ICLR 2017 .[8] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A genera-tive model for raw audio,” in
The 9th ISCA Speech Synthesis Workshop,Sunnyvale, CA, USA, September 2016 , p. 125.[9] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet: A convolutionalgenerative adversarial network for symbolic-domain music generation,”in
Proc. of the 18th Int. Society for Music Information Retrieval Conf.,ISMIR, Suzhou, China, October 23-27, 2017 , pp. 324–331.[10] D. Eck and J. Schmidhuber, “A first look at music composition usinglstm recurrent neural networks,”
Istituto Dalle Molle Di Studi SullIntelligenza Artificiale , vol. 103, p. 48, 2002.[11] C.-H. Chuan and D. Herremans, “Modeling temporal tonal relationsin polyphonic music through deep networks with a novel image-basedrepresentation.” in
Proc. of the Thirty-Second AAAI Conference onArtificial Intelligence , 2018, pp. 2159–2166. https://github.com/guozixunnicolas/CM-HRNN [12] C.-Z. A. Huang, D. Duvenaud, and K. Z. Gajos, “Chordripple: Recom-mending chords to help novice composers go beyond the ordinary,” in Proc. 21st Int. Conf. on Intelligent User Interfaces , 2016, pp. 241–250.[13] C.-H. Chuan, K. Agres, and D. Herremans, “From context to concept:exploring semantic relationships in music with word2vec,”
NeuralComputing and Applications , vol. 32, no. 4, pp. 1023–1036, 2020.[14] S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan, “Thistime with feeling: Learning expressive musical performance,”
NeuralComputing and Applications , vol. 32, no. 4, pp. 955–967, 2020.[15] W. B. De Haas and A. Volk, “Meter detection in symbolic musicusing inner metric analysis,” in
Proc. of the 17th Int. Society forMusic Information Retrieval Conf., ISMIR, New York City, United States,August 7-11, 2016 , pp. 441–447.[16] D. Herremans, S. Weisser, K. S¨orensen, and D. Conklin, “Generatingstructured music for bagana using quality metrics based on markovmodels,”
Expert Systems with Applications , vol. 42, no. 21, pp. 7424–7435, 2015.[17] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon,C. Hawthorne, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck,“Music transformer,” in in Proc. of the 7th Int. Conf. on LearningRepresentations, ICLR 2019 .[18] J. Chung, S. Ahn, and Y. Bengio, “Hierarchical multiscale recurrent neu-ral networks,” in
Proc. of the 5th Int. Conf. on Learning Representations,ICLR 2017 .[19] J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber, “A clockworkrnn,” in
Proc. of the 31th Int. Conf. on Machine Learning, ICML 2014,Beijing, China, 21-26 June 2014 , vol. 32, pp. 1863–1871.[20] J. Wu, C. Hu, Y. Wang, X. Hu, and J. Zhu, “A hierarchical recurrentneural network for symbolic melody generation,”
IEEE Transactions onCybernetics , vol. 50, no. 6, pp. 2749–2757, 2019.[21] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “A hierar-chical latent vector model for learning long-term structure in music,” in
Proc. of the 35th Int. Conf. on Machine Learning, ICML, Stockholm,Sweden, July 10-15, 2018 , vol. 80, pp. 4361–4370.[22] D. Herremans and E. Chew, “Morpheus: generating structured musicwith constrained patterns and tension,”
IEEE Transactions on AffectiveComputing , vol. 10, no. 4, pp. 510–523, 2017.[23] S. Lattner, M. Grachten, G. Widmer et al. , “Imposing higher-levelstructure in polyphonic music generation using convolutional restrictedboltzmann machines and constraints,”
Journal of Creative Music Sys-tems , vol. 2, p. 1, 2018.[24] J. Schmidhuber, “Learning complex, extended sequences using theprinciple of history compression,”
Neural Computation , vol. 4, no. 2,pp. 234–242, 1992.[25] S. Hihi and Y. Bengio, “Hierarchical recurrent neural networks for long-term dependencies,”
Advances in neural information processing systems ,vol. 8, pp. 493–499, 1995.[26] M. Huzaifah and L. Wyse, “Mtcrnn: A multi-scale rnn for directed audiotexture synthesis,” arXiv preprint arXiv:2011.12596 , 2020.[27] F. Colombo, S. P. Muscinelli, A. Seeholzer, J. Brea, and W. Gerstner,“Algorithmic composition of melodies with deep recurrent neural net-works,” in ,2016.[28] D. Conklin and I. H. Witten, “Multiple viewpoint systems for musicprediction,”
Journal of New Music Research , vol. 24, no. 1, pp. 51–73,1995.[29] G. Medeot, S. Cherla, K. Kosta, M. McVicar, S. Abdallah, M. Selvi,E. Newton-Rex, and K. Webster, “Structurenet: Inducing structure ingenerated melodies.” in
Proc. of the 19th Int. Society for Music Infor-mation Retrieval Conf., ISMIR, Paris, France, September 23-27, 2018
Journal of New Music Research , vol. 31, no. 4,pp. 321–345, 2002.[32] D. Herremans, E. Chew et al. , “Tension ribbons: Quantifying andvisualising tonal tension.” in
Proc. of Int. Conf. on Technologies forMusic Notation and Representation (TENOR), 2016 , vol. 2, pp. 8–18.[33] E. Chew, “The spiral array: An algorithm for determining key bound-aries,” in