[PDF] Music FaderNets: Controllable Music Generation Based On High-Level Features via Low-Level Feature Modelling

Abstract

High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-level feature representations with a limited amount of data, by first modelling their corresponding quantifiable low-level attributes. We refer to our proposed framework as Music FaderNets, which is inspired by the fact that low-level attributes can be continuously manipulated by separate "sliding faders" through feature disentanglement and latent regularization techniques. High-level features are then inferred from the low-level representations through semi-supervised clustering using Gaussian Mixture Variational Autoencoders (GM-VAEs). Using arousal as an example of a high-level feature, we show that the "faders" of our model are disentangled and change linearly w.r.t. the modelled low-level attributes of the generated output music. Furthermore, we demonstrate that the model successfully learns the intrinsic relationship between arousal and its corresponding low-level attributes (rhythm and note density), with only 1% of the training set being labelled. Finally, using the learnt high-level feature representations, we explore the application of our framework in style transfer tasks across different arousal states. The effectiveness of this approach is verified through a subjective listening test.

Full PDF

MMUSIC FADERNETS: CONTROLLABLE MUSIC GENERATION BASEDON HIGH-LEVEL FEATURES VIA LOW-LEVEL FEATURE MODELLING

Hao Hao Tan Dorien Herremans Singapore University of Technology and Design {haohao_tan, dorien_herremans}@sutd.edu.sg

ABSTRACT

High-level musical qualities (such as emotion) are oftenabstract, subjective, and hard to quantify. Given these dif-ﬁculties, it is not easy to learn good feature representa-tions with supervised learning techniques, either becauseof the insufﬁciency of labels, or the subjectiveness (andhence large variance) in human-annotated labels. In thispaper, we present a framework that can learn high-levelfeature representations with a limited amount of data, byﬁrst modelling their corresponding quantiﬁable low-level attributes. We refer to our proposed framework as MusicFaderNets, which is inspired by the fact that low-level at-tributes can be continuously manipulated by separate “slid-ing faders” through feature disentanglement and latent reg-ularization techniques. High-level features are then in-ferred from the low-level representations through semi-supervised clustering using Gaussian Mixture VariationalAutoencoders (GM-VAEs). Using arousal as an exampleof a high-level feature, we show that the “faders” of ourmodel are disentangled and change linearly w.r.t. the mod-elled low-level attributes of the generated output music.Furthermore, we demonstrate that the model successfullylearns the intrinsic relationship between arousal and its cor-responding low-level attributes (rhythm and note density),with only of the training set being labelled. Finally,using the learnt high-level feature representations, we ex-plore the application of our framework in style transfertasks across different arousal states. The effectiveness ofthis approach is veriﬁed through a subjective listening test.

1. INTRODUCTION

We consider low-level musical attributes as attributes thatare relatively straightforward to quantify, extract and cal-culate from music, such as rhythm, pitch, harmony, etc.On the other hand, high-level musical attributes refer to se-mantic descriptors or qualities of music that are relativelyabstract, such as emotion, style, genre, etc. Due to the na-ture of abstractness and subjectivity in these high-level mu-sical qualities, obtaining labels for these qualities typically c (cid:13) Hao Hao Tan, Dorien Herremans. Licensed under a Cre-ative Commons Attribution 4.0 International License (CC BY 4.0).

At-tribution:

Hao Hao Tan, Dorien Herremans, “Music FaderNets: Con-trollable Music Generation Based On High-Level Features via Low-LevelFeature Modelling”, in

Proc. of the 21st Int. Society for Music Informa-tion Retrieval Conf.,

Montréal, Canada, 2020. requires human annotation. However, training conditionalmodels on top of these human-annotated labels using su-pervised learning might result in sub-par performance be-cause ﬁrstly, obtaining such labels can be costly, hence theamount of labels collected might be insufﬁcient to train amodel that can generalize well [1]; Secondly, the anno-tated labels could have high variance among raters due tothe subjectivity of these musical qualities [2, 3].Instead of inferring high-level features directly from themusic sample, we propose to use low-level features as a“bridge” between the music and the high level features.This is because the relationship between the sample andits low-level features can be learnt relatively easier, as thelabels are easier to obtain. In addition, we learn the rela-tionship between the low-level features and the high-levelfeatures in a data-driven manner. In this paper, we showthat the latter works well even with a limited amount oflabelled data. Our work relies heavily on the concept thateach high-level feature is intrinsically related to a set oflow-level attributes. By tweaking the levels of each low-level attribute in a constrained manner, we can achieve adesired change on the high-level feature. This idea is heav-ily exploited in rule-based systems [4–6], however rule-based systems are often not robust enough as their capa-bilities are constrained by the ﬁxed set of predeﬁned ruleshandcrafted by the authors. Hence, we propose an alterna-tive path which is to learn these implicit relationships withsemi-supervised learning techniques.To achieve the goals stated above, we intend to build aframework which can fulﬁll these two objectives: • Firstly, the model should be able to control multiplelow-level attributes of the music sample in a contin-uous manner, as if it is controlled by sliding knobson a console (or also known as faders ). Each knobshould be independent from the others, and onlycontrols one single feature that it is assigned to. • Secondly, the model should be able to learn the rela-tionship between the levels of the sliding knobs con-trolling the low-level features, and the selected high-level feature. This is analogous to learning a preset of the sliding knobs on a console.We named our model “Music FaderNets”, with refer-ence to musical “faders” and “presets” as described above.Achieving the ﬁrst objective requires representation learn-ing and feature disentanglement techniques. This moti-vates us to use latent variable models [7] as we can learn a r X i v : . [ ee ss . A S ] J u l eparate latent spaces for each low-level feature to obtaindisentangled controllability. Achieving the second objec-tive requires the latent space to have a hierarchical struc-ture, such that high-level information can be inferred fromlow-level representations. This is achieved by incorporat-ing Gaussian Mixture VAEs [8] in our model.

2. RELATED WORK2.1 Controllable Music Generation

The application of deep learning techniques for music gen-eration has been rapidly advancing [9–13], however, em-bedding control and interactivity in these systems still re-mains a critical challenge [10]. Variants of conditionalgenerative models (such as CGAN [14] and CVAE [15])are used to allow control during generation, which have at-tained much success mainly in the image domain. FaderNetworks [16] is one of the main inspirations of this work(hence the name Music FaderNets), in which users canmodify different visual features of an image using “slid-ing faders”. However, their approach is built upon aCVAE with an additional adversarial component, whichis very different from our approach. Recently, control-lable music generation has gained much research interest,both on modelling low-level [17–20] and high-level fea-tures [21, 22]. Speciﬁcally, [18] and [19] each proposeda novel latent regularization method to encode attributesalong speciﬁc latent dimensions, which inspired the "slid-ing knob" application in this work.

Disentangled representation learning has been widely usedacross both the visual [23–26] and speech domain [1, 27,28] to learn disjoint subsets of attributes. Such techniqueshave also been applied to music in several recent works,both in the audio [29–31] and symbolic domain [32–34].The discriminator component in our model draws inspira-tion from both the explicit conditioning component in theEC -VAE model [33], and the extraction component in theExt-Res model [34]. We ﬁnd that most of the work ondisentanglement in symbolic music focuses on low-levelfeatures, and is done on monophonic music.This research distinguishes itself from other relatedwork through the following novel contributions: • We combine latent regularization techniques withdisentangled representation learning to build aframework that can control various continuous low-level musical attribute values using “faders”, and ap-ply the framework on polyphonic music modelling. • We show that it is possible to infer high-level fea-tures from low-level latent feature representations,even under weakly supervised scenario. This opensup possibilities to learn good representations for ab-stract, high-level musical qualities even under datascarcity conditions. We further demonstrate that thelearnt representations can be used for controllablegeneration based on high-level features.

3. PROPOSED FRAMEWORK3.1 Gaussian Mixture Variational Autoencoders

VAEs [35] combine the power of both latent variable mod-els and deep generative models, hence they provide bothrepresentation learning and generation capabilities. Givenobservations X and latent variables z , the VAE learns agraphical model z → X by maximizing the evidence lowerbound (ELBO) of the marginal likelihood p ( X ) as below: L ( p, q ; X ) = E q ( z | X ) [ log p ( X | z )] − D KL ( q ( z | X ) || p ( z )) where q ( z | X ) and p ( z ) represent the learnt posterior andprior distribution respectively. In vanilla VAEs, p ( z ) isan isotropic, unimodal Gaussian. Gaussian Mixture VAEs(GM-VAE) [8] extend the prior to a mixture of K Gaussiancomponents, which corresponds to learning a graphicalmodel with an extra hierarchy of dependency c → z → X .The newly introduced categorical variable c ∈ C , whereby |C| = K , is a discrete representation of the observations.Hence, a new distribution q ( c | X ) is introduced to infer theclass of each observation, which enables semi-supervisedand unsupervised clustering applications.Following [8], the ELBO of a GM-VAE is derived as: L ( p, q ; X ) = E q ( z | X ) [ log p ( X | z )] − K (cid:88) k =1 q ( c k | X ) D KL ( q ( z | X ) || p ( z | c k )) − D KL ( q ( c | X ) || p ( c )) The original KL loss term from the vanilla VAE is mod-iﬁed into two new terms: (i) the KL divergence betweenthe approximate posterior q ( z | X ) and the conditional prior p ( z | c k ) , marginalized over all Gaussian components; (ii)the KL divergence between the cluster inferring distribu-tion q ( c | X ) , and the categorical prior p ( c ) . Figure 1 shows the model formulation of our proposedMusic FaderNets. Input X is a sequence of performancetokens converted from MIDI following [12, 13]. Assumethat we want to model a high-level feature with K discretestates, which is related to a set of N low-level features. Wedenote the latent variables learnt for each low-level featureas z ...N ; the labels for each low-level feature as y ...N ;and the class inferred from each latent variable as c ...N .The joint probability of X , z ...N , c ...N is written as: p ( X , z ...N , c ...N ) = p ( X | z ...N ) N (cid:89) i =1 p ( z i | c i ) N (cid:89) i =1 p ( c i ) We assume that each categorical prior p ( c i ) , i ∈ [1 , N ] is uniformly distributed, and the conditional distributions p ( z i | c i ) = N ( µ c i , diag ( σ c i )) are diagonal-covarianceGaussians with learnable means and constant variances.For each low-level attribute, we learn an approximate pos-terior q ( z i | X ) , parameteri z ed by an encoder neural net-work, that samples latent code z i which represents the i -thlow-level feature. igure 1 . Music FaderNets model architecture.The latent codes z ...N are then passed through the re-maining three components: (1) Discriminator : To ensurethat z i incorporates information of the assigned low-levelfeature, it is passed through a discriminator representedby a function d ( z i ) to reconstruct the low-level feature la-bel y i ; (2) Reconstruction : All latent codes are fed intoa global decoder network which parameterizes the condi-tional probability p ( X | z ...n ) to reconstruct the input X ;(3) Cluster Inference : This component parameterizes thecluster inference probability q ( c | X ) , with c representingthe selected high-level feature. It can be approximated by q ( c | X ) ≈ E q ( z | X ) p ( c | z ) [36], where the cluster state is pre-dicted from each latent code z i instead of X .To incorporate the “sliding knob” concept, we need tomap the change of value of an arbitrary dimension on z i (denoted as z di , shown on Figure 1 as the darkened dimen-sion) linearly to the change of value of the low-level featurelabel y i . After comparing across previous methods on con-ditioning and regularization [15, 16, 18, 19], we choose toadopt [19] which applies a latent regularization loss termwritten as L reg ( z di , y i ) = MSE ( tanh ( D z di ) , sign ( D y i )) ,where D z di and D y i denotes the distance matrix of values z di and y i within a training batch respectively. We providea detailed comparison study across each proposed methodin Section 4.2. Hence, if we deﬁne: L iφ ( p, q ; X ) =  K (cid:88) k =1 q ( c i,k | X ) D KL ( q ( z i | X ) || p ( z i | c i,k ))+ D KL ( q ( c i | X ) || p ( c i )) , if unsupervised D KL ( q ( z i | X ) || p ( z i | c i )) , if supervised (1)then the entire training objective can be derived as: L ( p, q ; X ) = E q ( z | X ) ...q ( z N | X ) [ log p ( X | z , z , ..., z N )] − β · N (cid:88) i =1 L iφ ( p, q ; X ) + N (cid:88) i =1 L reg ( z di , y i )+ E q ( z | X ) ...q ( z n | X ) [ log p ( y | z ) ...p ( y N | z N )] (2) where β is the KL weight hyperparameter [24]. The ﬁrstterm in Eq. 2 represents the reconstruction loss. The sec-ond KL loss term (derived from the ELBO function ofGM-VAE) correspond to the cluster inference component,which allows both supervised and unsupervised trainingsetting, depending on the availability of label c . If weomit the cluster inference component, it could conform toa vanilla VAE by replacing this term with the KL loss termof VAE. The third term is the latent regularization loss ap-plied during the encoding process. The last term is thereconstruction loss of the low-level feature labels, whichcorresponds to the discriminator component. All encodersand decoders are implemented with gated recurrent units(GRUs), and teacher-forcing is used to train all decoders.

4. EXPERIMENTAL SETUP

In this work, we chose arousal (which refers to the energylevel conveyed by the song [37]) as the high-level feature tobe modelled. In order to select relevant low-level features,we refer to musicology papers such as [6, 38, 39], whichsuggest that arousal is related to features including rhythmdensity, note density, key, dynamic, tempo, etc. Amongthese low-level features, we focus on modelling the score-level features in this work (i.e. rhythm, note and key).

We use two polyphonic piano music datasets for training:the

Yamaha Piano-e-Competition dataset [12], and the

VGMIDI dataset [3], which contains piano arrangementsof 95 video game soundtracks in MIDI, annotated with va-lence and arousal values in the range of -1 to 1. The arousallabels are used to guide the cluster inference component inour GM-VAE model using semi-supervised learning. Weextract every 4-beat segment from each music sample, witha beat resolution of 4 (quarter-note granularity). Each seg-ment is encoded into event-based tokens following [12]with a maximum sequence length of 100. This results ina total of 103,934 and 1,013 sequences from the Piano e-Competition and VGMIDI dataset respectively, which aresplit into train/validation/test sets with a ratio of 80/10/10.nspired by [33], we represent each rhythm label, y rhythm , as a sequence of 16 one-hot vectors with 3 dimen-sions, denoting an onset for any pitch, a holding state, ora rest. The rhythm density value is calculated as the num-ber of onsets in the sequence divided by the total sequencelength. Each note label, y note , is represented by a sequenceof 16 one-hot vectors with 16 dimensions, each dimensiondenoting the number of notes being played or held at thattime step (we assume a minimum polyphony of 0 and amaximum of 15). The note density value is the averagenumber of notes being played or held for per time step.For key, we use the key analysis tool from music21 [40]to extract the estimated global key of each 4-beat segment.The key is represented using a 24-dimension one-hot vec-tor, accounting for major and minor modes. In this work,we directly concatenate the key vector as a conditioningsignal with z rhythm and z note as an input to the global de-coder for reconstruction. For representing arousal, we splitthe arousal ratings into two clusters ( K = 2) : high arousalcluster for positive labels, and low arousal cluster for neg-ative labels. We remove labels annotated within the range[-0.1, 0.1] so as to reduce ambiguity in the annotations.The hyperparameters are tuned according to the resultson the validation set using grid search. The mean vectorsof p ( c | z ) are all randomly initialized with Xavier initial-ization, whereas the variance vectors are kept ﬁxed withvalue e − . We observe that the following annealing strat-egy for β leads to the best balance between reconstructionand controllability: β is set to 0 in the ﬁrst 1,000 trainingsteps, and is slowly annealed up to 0.2 in the next 10,000training steps. We set the batch size to 128, all hidden sizesto 512, and the encoded z dimensions to 128. The Adamoptimizer is used with a learning rate of − . The proposed Music FaderNets model should meet two re-quirements: (i) Each “fader” independently controls onelow-level musical feature without affecting other features(disentanglement), and (ii) the “faders” should change lin-early with the controlled attribute of the generated output(linearity). For disentanglement, we follow the deﬁnitionproposed in [41] which decomposes the concept of disen-tanglement into generator consistency and generator re-strictiveness . Using rhythm density as an example: • Consistency on rhythm density means that for thesame value of z d rhythm , the value of the output’srhythm density should be consistent. • Restrictiveness on rhythm density means that chang-ing the value of z d rhythm does not affect the attributesother than rhythm density (in our case, note density). • Linearity on rhythm density means that the value ofrhythm density is directly proportional to the valueof z d rhythm , which is analogous to a sliding fader.We will be evaluating all three of these points in our ex-periment. For evaluating linearity, [19] proposed a slightlymodiﬁed version of the interpretability metric by [42], Figure 2 . Workﬂow of obtaining evaluation metrics for“faders” controlling rhythm density.which includes the following steps: (1) encode each sam-ple in the test set, obtain the rhythm latent code and thedimension z d which has the maximum mutual informationwith regards to the attribute; (2) learn a linear regressorto predict the input attribute values based on z d . The lin-earity score is hence the coefﬁcient of determination ( R )score of the linear regressor. However, this method eval-uates only the encoder and not the decoder. As we wantthe sliding knobs to directly impact the output, we arguethat the relationship between z d and the output attributesshould be more important. Hence, we propose to “slide”the values of the regularized dimension z d within a givenrange and decode them into reconstructed outputs. Then,instead of predicting the input attributes given the encoded z d , the linear regressor learns to predict the corresponding output attributes given the “slid” values of z d .We demonstrate a single workﬂow to calculate the con-sistency, restrictiveness and linearity scores of a givenmodel based on the low-level features (we use rhythm den-sity as an example low-level feature for the discussion be-low), as depicted in Figure 2. After obtaining the rhythmdensity latent code for all samples in the training set andﬁnding the minimum and maximum value of z d rhythm , we“slide” for T = 8 steps by calculating min( z d rhythm ) + tT (max( z d rhythm ) − min( z d rhythm )) , with t ∈ [1 , T ] . This re-sults in a list of values denoted as [ z d rhythm ] ...T . Then, weconduct the following steps:1. Randomly select M = 100 samples from the testset, and encode each sample into z rhythm and z note ;2. Alter the d -th element in z rhythm using the values inthe range [ z d rhythm ] ...T , to obtain [ˆ z rhythm ] m, ...T foreach sample m ;3. Decode each new rhythm density latent code to-gether with the unchanged note density latent code z note to get ˆ X m, ...T ;4. Calculate rhythm density r m, ...T and note density n m, ...T for each reconstructed output;5. Pair up the new rhythm density latent code with theresulting rhythm density of the output as T trainingdata points p m = { ([ z d rhythm ] t , r m,t ) | t ∈ [1 , T ] } fora linear regressor.The ﬁnal evaluation scores are then calculated as follows:Consistency score = 1 − T T (cid:88) t =1 σ t ( r ...M,t ) (3) onsistency Restrictiveness LinearityRhythm Density Note Density Rhythm Density Note Density Rhythm Density Note DensityProposed (Vanilla VAE) 0.4367 ± ± ± ± ± ± Proposed (GM-VAE) ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 1 . Experimental results (conducted on the Yamaha dataset test split) on the controllability of low-level features(rhythm density and note density) using disentangled latent variables. Bold marks the best performing model.Restrictiveness score = 1 − M M (cid:88) m =1 σ m ( n m, ..T ) (4)Linearity score = R ( M ( p ...M )) (5)where σ ( · ) denotes the standard deviation, and M denotesthe linear regressor model. In other words, consistencycalculates the average standard deviation across all out-put rhythm density values given the same z d rhythm , whereasrestrictiveness calculates the average standard deviationacross all output note density values given the changing z d rhythm . In a perfectly disentangled and linear model, theconsistency, restrictiveness and linearity scores should beequal to 1, and higher scores indicate better performance.

5. EXPERIMENTS AND RESULTS

We compare the evaluation scores of our proposed model,using both a vanilla VAE (omitting the cluster inferencecomponent) and GM-VAE, with several models proposedin related work on controllable synthesis: CVAE [15],Fader Networks [16], GLSR [18] and Pati et al. [19]. Werepeat the above steps for 10 runs for each model and re-port the mean and standard deviation of each score. Table 1shows the evaluation results. Overall, our proposed modelsachieve a good all-rounded performance on every metric ascompared to other models, especially in terms of linearity,models that use [19]’s regularization method largely out-perform other models. Our model shares similar resultswith [19], however as compared to their work, we encodea multi-dimensional, regularized latent space instead of asingle dimension value for each low-level feature, thus al-lowing more ﬂexibility. Our model can also be used for“generation via analogy” as mentioned in EC -VAE [33],by mix-matching z rhythm from one sample with z note fromanother. Moreover, the feature latent vectors can be used toinfer interpretable and semantically meaningful clusters. Figure 3 visualizes the rhythm and note density latentspace learnt by GM-VAE using t-SNE dimensionality re-duction. We observe that both spaces successfully learn aGaussian-mixture space with two well-separated compo-nents, which correspond to high and low arousal clusters,even though it was trained with only around 1% of labelleddata. We also ﬁnd that the regularized z d values capture Figure 3 . Visualization of rhythm (top) and note (bottom)density latent space in the GM-VAE. Each column is col-ored in terms of: (left) original density values, (middle)regularized z d values, (right) arousal cluster labels (0 refersto low arousal and 1 refers to high arousal).the overall trend of the actual rhythm and note density val-ues. Interestingly, the model learns the implicit relation-ship between high/low arousal and the corresponding lev-els of rhythm/note density. From Figure 3, we observe thatthe high arousal cluster corresponds to higher rhythm den-sity and lower note density, whereas the low arousal clus-ter corresponds to lower rhythm density and higher notedensity. This is reasonable as music segments with higharousal often consist of fast running notes and arpeggios,being played one note at a time, whereas music segmentswith low arousal often exhibit a chordal texture with moresustaining notes and relatively less melodic activity.To further inspect the importance of using low-levelfeatures, we train a separate GM-VAE model with onlyone encoder (without discriminator component), which en-codes only a single latent vector for each segment. Themodel is trained to infer the arousal label with the singlelatent vector similarly in a semi-supervised manner, andthe hyperparameters are kept the same. From Figure 4,we can observe that the latent space learnt without usinglow-level features is not well-segregated into two separatecomponents, suggesting that the right choice of low-levelfeatures helps the learning of a more discriminative anddisentangled feature latent space.The major advantage demonstrated from the resultsabove is that by carefully choosing low-level features sup-ported by domain knowledge, semi-supervised (or weaklysupervised) training can be leveraged to learn interpretablerepresentations that can capture implicit relationships be-tween high-level and low-level features, overcoming the igure 4 . Arousal cluster visualization of GM-VAE with(left), and without (right) using low-level features. Figure 5 . Examples of arousal transfer on music samples.difﬁculties mentioned in the introduction section. This isan important insight for learning representations of abstractmusical qualities under label scarcity conditions in future.

Utilizing the learnt high-level feature representations en-ables the application of feature style transfer. Follow-ing [29], given the means of each Gaussian component, µ arousal=0 and µ arousal=1 , the “shifting vector” from higharousal to low arousal is s low_shift = µ arousal=0 − µ arousal=1 ,and vice versa. To shift a music segment from high tolow arousal, we modify the latent codes by z (cid:48) rhythm = z rhythm + s low_shift , z (cid:48) note = z note + s low_shift . Both new latentcodes z (cid:48) rhythm and z (cid:48) note are fed into the global decoder forreconstruction. For cases where c rhythm (cid:54) = c note , we chooseto perform shifting only on the latent codes which are notlying within the target arousal cluster. Figure 5 shows sev-eral examples of arousal shift performed on given musicsegments. We can observe that the shift is clearly accom-panied with the desired changes in rhythm density andnote density, as mentioned in Section 5.1. More exam-ples are available online. We also conducted a subjectivelistening test to evaluate the quality of arousal shift per-formed by Music FaderNets. We randomly chose 20 musicsegments from our dataset, and performed a low-to-higharousal shift on 10 segments and a high-to-low arousalshift on the other 10. Each subject listened to the originalsample and then the transformed sample, and was askedwhether (1) the arousal level changes after the transforma-tion, and; (2) how well the transformed sample sounds interms of rhythm, melody, harmony and naturalness, on a https://music-fadernets.github.io/ Figure 6 . Subjective listening test results. Left: Heat mapof annotated arousal level change against actual arousallevel change. Right: Bar plot of opinion scores for eachmusical quality, with 95% conﬁdence interval.Likert scale of 1 to 5 each.A total of 48 subjects participated in the survey. Wefound that 81.45% of the responses agreed with the ac-tual direction of level change in arousal, shifted by themodel. This showed that our model is capable of shiftingthe arousal level of a piece to a desired state. From the heatmap shown in Figure 6, we observe that shifting from highto low arousal has a higher rate of agreement (92.5%) thanshifting from low to high arousal (70.41%). Meanwhile,the mean opinion score of rhythm, melody, harmony andnaturalness were reported at 3.53, 3.39, 3.41 and 3.33 re-spectively, showing that the quality of the generated sam-ples are generally above moderate level.

6. CONCLUSION AND FUTURE WORK

We propose a novel framework called Music FaderNets ,which can generate new variations of music samples bycontrolling levels (“sliding knobs”) of low-level attributes,trained with latent regularization and feature disentangle-ment techniques. We also show that the framework is ca-pable of inferring high-level feature representations (“pre-sets”, e.g. arousal) on top of latent low-level representa-tions by utilizing the GM-VAE framework. Finally, wedemonstrate the application of using learnt high-level fea-ture representations to perform arousal transfer, which wasconﬁrmed in a user experiment. The key advantage of thisframework is that it can learn interpretable mixture com-ponents that reveal the intrinsic relationship between low-level and high-level features using semi-supervised learn-ing, so that abstract musical qualities can be quantiﬁed ina more concrete manner with limited amount of labels.While the strength of arousal transfer is gradually in-creased, we ﬁnd that the identity of the original piece isalso gradually shifted. A recent work on text generationusing VAEs [43] observed this similar trait and attributedits cause to the “latent vacancy" problem by topologicalanalysis. A possible solution is to adopt the Constrained-Posterior VAE [43], in which we aim to explore in futurework. Future work will also focus on applying the frame-work on other sets of abstract musical qualities (such asvalence [37], tension [44], etc.), and extending the frame-work to model multi-track music with longer duration toproduce more complete music. Source code available at: https://github.com/gudgud96/music-fader-nets . ACKNOWLEDGEMENTS

We would like to thank the anonymous reviewers for theirconstructive reviews. We also thank Yin-Jyun Luo for theinsightful discussions on GM-VAEs. This work is sup-ported by MOE Tier 2 grant no. MOE2018-T2-2-161 andSRG ISTD 2017 129. The subjective listening test is ap-proved by the Institutional Review Board under SUTD-IRB 20-315. We would also like to thank the volunteersfor taking the subjective listening test.

8. REFERENCES [1] R. Habib, S. Mariooryad, M. Shannon, E. Battenberg,R. Skerry-Ryan, D. Stanton, D. Kao, and T. Bagby,“Semi-supervised generative modeling for control-lable speech synthesis,” in

International Conference ofLearning Representations , 2020.[2] A. Aljanaki, Y.-H. Yang, and M. Soleymani, “Emotionin music task at mediaeval 2015.” in

MediaEval , 2015.[3] L. N. Ferreira and J. Whitehead, “Learning to gener-ate music with sentiment,” in

Proc. of the InternationalSociety for Music Information Retrieval Conference ,2019.[4] R. Bresin and A. Friberg, “Emotional coloring ofcomputer-controlled music performances,”

ComputerMusic Journal , vol. 24, no. 4, pp. 44–63, 2000.[5] S. R. Livingstone, R. Muhlberger, A. R. Brown, andW. F. Thompson, “Changing musical emotion: A com-putational rule system for modifying score and perfor-mance,”

Computer Music Journal , vol. 34, no. 1, pp.41–64, 2010.[6] S. K. Ehrlich, K. R. Agres, C. Guan, and G. Cheng, “Aclosed-loop, music-based brain-computer interface foremotion mediation,”

PloS one , vol. 14, no. 3, 2019.[7] Y. Kim, S. Wiseman, and A. M. Rush, “A tutorial ondeep latent variable models of natural language,” arXivpreprint arXiv:1812.06834 , 2018.[8] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou,“Variational deep embedding: An unsupervised andgenerative approach to clustering,” arXiv preprintarXiv:1611.05148 , 2016.[9] D. Herremans and C.-H. Chuan, “The emergence ofdeep learning: new opportunities for music and audiotechnologies,”

Neural Computing and Applications ,vol. 32], p. 913–914, 2020.[10] J.-P. Briot, G. Hadjeres, and F. Pachet,

Deep learn-ing techniques for music generation . Springer, 2019,vol. 10.[11] D. Herremans, C.-H. Chuan, and E. Chew, “A func-tional taxonomy of music generation systems,”

ACMComputing Surveys (CSUR) , vol. 50, no. 5, pp. 1–30,2017. [12] S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Si-monyan, “This time with feeling: Learning expressivemusical performance,”

Neural Computing and Appli-cations , pp. 1–13, 2018.[13] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer,I. Simon, C. Hawthorne, A. M. Dai, M. D. Hoffman,M. Dinculescu, and D. Eck, “Music transformer: Gen-erating music with long term structure,” in

Interna-tional Conference of Learning Representations , 2019.[14] M. Mirza and S. Osindero, “Conditional generative ad-versarial nets,” arXiv preprint arXiv:1411.1784 , 2014.[15] K. Sohn, H. Lee, and X. Yan, “Learning structured out-put representation using deep conditional generativemodels,” in

Advances in Neural Information Process-ing Systems , 2015, pp. 3483–3491.[16] G. Lample, N. Zeghidour, N. Usunier, A. Bordes,L. Denoyer, and M. Ranzato, “Fader networks: Ma-nipulating images by sliding attributes,” in

Advancesin Neural Information Processing Systems , 2017, pp.5967–5976.[17] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, andD. Eck, “A hierarchical latent vector model for learninglong-term structure in music,” in

International Confer-ence of Machine Learning , 2018.[18] G. Hadjeres, F. Nielsen, and F. Pachet, “Glsr-vae:Geodesic latent space regularization for variational au-toencoder architectures,” in

IEEE Symposium Series onComputational Intelligence (SSCI) . IEEE, 2017, pp.1–7.[19] A. Pati and A. Lerch, “Latent space regularizationfor explicit control of musical attributes,” in

ICMLMachine Learning for Music Discovery Workshop(ML4MD), Extended Abstract, Long Beach, CA, USA ,2019.[20] J. Engel, M. Hoffman, and A. Roberts, “Latent con-straints: Learning to generate conditionally from un-conditional generative models,” in

International Con-ference of Learning Representations , 2017.[21] S. Dai, Z. Zhang, and G. G. Xia, “Music style transfer:A position paper,” in

Proc. of International Workshopon Musical Metacreation , 2018.[22] K. Choi, C. Hawthorne, I. Simon, M. Dinculescu, andJ. Engel, “Encoding musical style with transformer au-toencoders,” in

International Conference of MachineLearning , 2020.[23] X. Chen, Y. Duan, R. Houthooft, J. Schulman,I. Sutskever, and P. Abbeel, “Infogan: Interpretablerepresentation learning by information maximizinggenerative adversarial nets,” in

Advances in Neural In-formation Processing Systems , 2016, pp. 2172–2180.24] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot,M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrainedvariational framework.” in

International Conference ofLearning Representations , 2017.[25] H. Kim and A. Mnih, “Disentangling by factoris-ing,” in

International Conference of Machine Learn-ing , 2018.[26] L. Yingzhen and S. Mandt, “Disentangled sequentialautoencoder,” in

International Conference on MachineLearning , 2018, pp. 5670–5679.[27] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervisedlearning of disentangled and interpretable representa-tions from sequential data,” in

Advances in neural in-formation processing systems , 2017, pp. 1878–1889.[28] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan,E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A.Saurous, “Style tokens: Unsupervised style modeling,control and transfer in end-to-end speech synthesis,” in

International Conference of Machine Learning , 2018.[29] Y.-J. Luo, K. Agres, and D. Herremans, “Learning dis-entangled representations of timbre and pitch for mu-sical instrument sounds using gaussian mixture varia-tional autoencoders,” in

Proc. of the International Soci-ety for Music Information Retrieval Conference , 2019.[30] Y.-N. Hung, Y.-A. Chen, and Y.-H. Yang, “Learningdisentangled representations for timber and pitch inmusic audio,” arXiv preprint arXiv:1811.03271 , 2018.[31] Y.-N. Hung, I. Chiang, Y.-A. Chen, Y.-H. Yang et al. ,“Musical composition style transfer via disentangledtimbre representations,” in

International Joint Confer-ence of Artiﬁcial Intelligence , 2019.[32] G. Brunner, A. Konrad, Y. Wang, and R. Wattenhofer,“Midi-vae: Modeling dynamics and instrumentationof music with applications to style transfer,” in

Proc.of the International Society for Music Information Re-trieval Conference , 2018.[33] R. Yang, D. Wang, Z. Wang, T. Chen, J. Jiang, andG. Xia, “Deep music analogy via latent representationdisentanglement,” in

Proc. of the International Societyfor Music Information Retrieval Conference , 2019.[34] T. Akama, “Controlling symbolic music generationbased on concept learning from domain knowledge,”in

Proc. of the International Society for Music Infor-mation Retrieval Conference , 2019.[35] D. P. Kingma and M. Welling, “Auto-encoding varia-tional bayes,” in

International Conference of LearningRepresentations, ICLR , 2014.[36] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu,Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen et al. , “Hierar-chical generative modeling for controllable speech syn-thesis,” in

International Conference of Learning Rep-resentations , 2019. [37] J. A. Russell, “A circumplex model of affect.”

Journalof personality and social psychology , vol. 39, no. 6, p.1161, 1980.[38] A. Gabrielsson and E. Lindström, “The inﬂuence ofmusical structure on emotional expression.” 2001.[39] P. Gomez and B. Danuser, “Relationships between mu-sical structure and psychophysiological measures ofemotion.”

Emotion , vol. 7, no. 2, p. 377, 2007.[40] M. S. Cuthbert and C. Ariza, “music21: A toolkit forcomputer-aided musicology and symbolic music data,”in

Proc. of the International Society for Music Infor-mation Retrieval Conference , 2010.[41] R. Shu, Y. Chen, A. Kumar, S. Ermon, and B. Poole,“Weakly supervised disentanglement with guarantees,”in

International Conference of Learning Representa-tions , 2020.[42] T. Adel, Z. Ghahramani, and A. Weller, “Discoveringinterpretable representations for both deep generativeand discriminative models,” in

International Confer-ence on Machine Learning , 2018, pp. 50–59.[43] P. Xu, J. C. K. Cheung, and Y. Cao, “On variationallearning of controllable representations for text withoutsupervision,” in

International Conference on MachineLearning , 2020.[44] D. Herremans and E. Chew, “Morpheus: generat-ing structured music with constrained patterns andtension,”