Conditional Hybrid GAN for Sequence Generation
CConditional Hybrid GAN for Sequence Generation
Yi Yu
National Institute of Informatics, Tokyo, Japan [email protected]
Abhishek Srivastava ∗ MIDAS, IIIT-Delhi, India [email protected]
Rajiv Ratn Shah
MIDAS, IIIT-Delhi, India [email protected]
Abstract
Conditional sequence generation aims to instruct the generation procedure by con-ditioning the model with additional context information, which is a self-supervisedlearning issue (a form of unsupervised learning with supervision information fromdata itself). Unfortunately, the current state-of-the-art generative models have limi-tations in sequence generation with multiple attributes. In this paper, we proposea novel conditional hybrid GAN (C-Hybrid-GAN) to solve this issue. Discretesequence with triplet attributes are separately generated when conditioned onthe same context. Most importantly, relational reasoning technique is exploitedto model not only the dependency inside each sequence of the attribute duringthe training of the generator but also the consistency among the sequences ofattributes during the training of the discriminator. To avoid the non-differentiabilityproblem in GANs encountered during discrete data generation, we exploit theGumbel-Softmax technique to approximate the distribution of discrete-valued se-quences. Through evaluating the task of generating melody (associated with note,duration, and rest) from lyrics, we demonstrate that the proposed C-Hybrid-GANoutperforms the existing methods in context-conditioned discrete-valued sequencegeneration.
Conditional sequence generation has been a challenging research task in the ïˇn ˛Aeld of artiïˇn ˛Acialintelligence, which falls under the field of conditional discrete sequence generation. This generationaims to develop generative models that can automatically predict sequence with given contextinformation in a way similar to the creativity of human. An earlier study [1] has shown the feasibilityof exploiting conditional LSTM-GAN for sequence generation with multiple attributes. Althoughthis state-of-the-art generation has demonstrated meaningful results compared with the traditionalmaximum likelihood estimation (MLE), it fails to accurately model the discrete attributes. On the onehand, the continuous-valued sequence as the output of the generator in the GAN, is not in accordancewith the discrete-valued attributes. On the other hand, due to the quantization error, the generatedattributes could be associated with an improper discrete-valued music attribute, which would lead toa negative impact on sequence generation.To overcome the aforementioned disadvantage, in this work, the Gumbel-Softmax is exploited toapproximate the distribution of discrete-valued sequences. On this basis, a novel conditional hybridgenerative adversarial network(C-Hybrid-GAN) is suggested to generate discrete-valued sequences, ∗ Abhishek Srivastava was involved in this work from April to May 2020 during his internship at the NationalInstitute of Informatics, Tokyo, Japan.Preprint. Under review. a r X i v : . [ c s . A I] S e p here three discrete sequences of attributes are separately generated based on conditioning thesame context. We evaluate our generation model utilizing paired melody-lyrics sequences in [1].In particular, the relational reasoning technique is exploited to model the dependency inside eachsequence of music attribute during the training of the generator as well as the consistency amongthree sequences of music attributes during the training of the discriminator. Through the extensiveexperiments, we conclude the proposed C-Hybrid-GAN outperforms the existing melody generationmethods from lyrics through the extensive experiments and has the capability of generating morenatural and plausible melodies. Conditional sequence generation remains a challenging task, aiming to imitate the real sequenceconditioned on the specific context input. In this work, we focus on exploiting GAN but having twosignificant contributions: i) relational reasoning technique is applied to modeling both dependencyinside each sequence of the attribute and consistency among three sequences of the attributes duringthe training stage. ii) Gumbel-Softmax technique is utilized to approximate the discrete-valueddistribution of attributes. Generative adversarial networks (GANs) were originally developed togenerate continuous data, which have been applied successfully in the conditional sequence generationsuch as lyrics-to-melody [1], text-to-video [2], and dialogue [3] generation. However, GANs have thelimitation in generating discrete sequence due to the non-differentiable problem of the discrete-valuedoutputs from the generator. To overcome this disadvantage, existing works pay attention to tworesearch lines: i) policy gradient based on reinforcement learning and ii) continuous approximationof the discrete distribution.i) policy gradient based on reinforcement learning. In MaskGAN [4] the authors propose the actor-critic GAN architecture that uses reinforcement learning to train the generator, where the in-fillingtechnique may alleviate mode collapse. RankGAN [5] uses the ranking score as the rewards to learnthe generator, which is optimized through the policy gradient method. LeakGAN [6] addresses amechanism of providing richer information from the discriminator to the generator by exploitinghierarchical reinforcement learning. SeqGAN [7] models the generator as a stochastic policy inreinforcement learning, which avoids the generator differentiation problem by directly performingpolicy-gradient updates.ii) continuous approximation of the discrete distribution. Instead of applying standard GAN objective,FM-GAN [8] suggests to match the latent feature distributions of real and synthetic sentencesexploiting the feature-movers distance. In TextGAN [9], the authors utilize a kernelized discrepancymetric to map high-dimensional latent feature distributions of real and synthetic sentences, with theaim of mitigating the model collapse. In ARAE [10], the authors utilize the adversarial autoencoder totransform the discrete data into a continuous latent space for GAN training. In GAN for sequences ofdiscrete elements [11] and RELGAN [12], Gumbel-Softmax approaches are suggested to approximatethe discrete-valued distribution for continuous-valued distribution.
We propose an end-to-end deep generative model for generating sequence conditioned on the context.The proposed C-Hybrid-GAN model is trained by considering the alignment relationship betweenattributes of the sequences and their corresponding context. The idea contains i) a generator withthree independent relational memory based conditional sub-networks, and ii) a discriminator basedon single relational memory for long-distance dependency modeling and for providing informativeupdates to the generator. Gumbel-Softmax relaxation technique is exploited to train GAN forgenerating discrete-valued sequences. Particularly, a hybrid structure is used in the adversarialtraining stage, containing three independent branches for attributes in the generator and one branchfor concatenating attributes in the discriminator. Relational memory is employed to model notonly the dependency inside sequence of attribute during the training of the generator but also theconsistency among three sequences of attributes during the training of the discriminator.
Relational memory core (RMC) as a relational reasoning technique proposed by Santoro et al.[13] iscomposed of a fixed set of memory slots and employs multi-head dot product attention (MHDPA)2lso known as self-attention [14] between the memory slots to enable interaction between themand facilitate long term dependency modeling. Santoro et al. [13] show empirically that RMC isbetter-suited for tasks such as language modeling that benefits from relational reasoning across thesequential information as compared to the LSTM.Formally, we suppose M t represents the memory of the RMC module and x t represents the in-put at time t . Let H represent the number of attention heads. For each head h , M t is usedto construct queries Q ( h ) t = M t W ( h ) q , and its combination with x t is used to construct keys K ( h ) t = [ M t ; x t ] W ( h ) k and values V ( h ) t = [ M t ; x t ] W ( h ) v , where [; ] represents the row-wise con-catenation operation, W ( h ) k , W ( h ) v , W ( h ) q are weights. An attention weight is computed from Q ( h ) t and K ( h ) t , and ˜ M t +1 is computed as the product of attention weight and the value, as follows. ˜ M t +1 = [ ˜ M (1) t +1 : · · · : ˜ M ( H ) t +1 ] , ˜ M ( h ) t +1 = sof tmax ( Q ( h ) t ( K ( h ) t ) T √ d k ) V ( h ) t (1)where d k is the column dimension of the key K ( h ) t and [:] represents column-wise concatenation.Then, the memory M t +1 is updated and the output o t is computed from ˜ M t +1 and M t by M t +1 = f θ ( ˜ M t +1 , M t ) , o t = f θ ( ˜ M t +1 , M t ) (2)where f θ and f θ are parameterized functions consisting of skip connections, multi-layer perceptron,and gated operation. The role of the generator network is to generate a sequence with given context information. Thegenerator network is composed of three independent relational memory based conditional sub-networks. Each sub-network is responsible for generating a sequence of a particular attributeconditioned on the context, for example, in melody sequence generation from lyrics, i.e. either a pitchsequence, ˆ y p , ··· , T , a duration sequence, ˆ y d , ··· , T , or a rest sequence, ˆ y r , ··· , T . The key componentof each sub-network is the RMC module.Figure 1: Architecture of conditional hybrid GAN.We take melody sequence generation from lyrics in Fig. 1 to explain our generation process, whichcan be easily extended to other research scenario of sequence generation with multiple attributes.Here, with the pitch attribute as an example, the similar process applies to the other two musicattributes (duration, rest). At each time step t , the input to the sub-network is one-hot encodedrepresentation of the pitch attribute generated during the previous time step ˆ y pt − ∈ R and theembedded lyrics syllable x t ∈ R . During the forward pass of the sub-network, ˆ y pt − is passedthrough a linear layer to obtain a dense representation of the pitch attribute. The dense representationof the pitch attribute is then concatenated with x t and passed through a fully connected (FC) layer3ith ReLU activation. The output of FC layer and the RMC memory M t − are then passed throughthe RMC layer. The RMC output is then passed through a linear layer to obtain the output logits o t ∈ R . The Gumbel-Softmax operation is performed on o t to obtain the one-hot approximationof the pitch attribute ˆ y pt ∈ R . ˆ y pt − ∼ U nif orm (0 , is used for the initial time step.Since sequences with length T = 20 are utilized in our model, we repeat this process for steps and generate the pitch sequence ˆ y p = [ˆ y p , ˆ y p , · · · , ˆ y pT ] where ˆ y pt ∈ R , ≤ t ≤ . Theother two sub-networks respectively follow the same procedure to generate a duration sequence ˆ y d = [ˆ y d , ˆ y d , · · · , ˆ y dT ] where ˆ y dt ∈ R , ≤ t ≤ and a rest sequence ˆ y r = [ˆ y r , ˆ y r , · · · , ˆ y rT ] where ˆ y rt ∈ R , ≤ t ≤ .In the generator network, the embedding dimensions of pitch, duration, and rest are set to 32, 16, and8 respectively. In the pitch sub-network, the fully connected layer following the embedding layeruses ReLU activation with 64 units. The RMC layer following the fully connected layer uses a singlememory slot with the head size set to 64, the number of heads set to 2, and the number of blocksset to 2. In the duration sub-network, the fully connected layer following the embedding layer usesReLU activation with 32 units. The RMC layer following the fully connected layer uses a singlememory slot with the head size set to 32, the number of heads set to 2, and the number of blocks setto 2. In the rest sub-network, the fully connected layer following the embedding layer uses ReLUactivation with 16 units. The RMC layer following the fully connected layer uses a single memoryslot with the head size set to 16, the number of heads set to 2 and the number of blocks set to 2. Training GANs for the generation of discrete data faces a non-differentiable problem due to discrete-valued output from the generator. The gradient of the generator loss ∂loss G ∂θ G cannot be back propagatedto the generator via the discriminator and hence generator parameters θ G cannot be updated. Toovercome this issue, we apply the Gumbel-Softmax relaxation technique. Using the generator sub-network responsible for generating the pitch attribute as an example, we explain more about thenon-differentiability issue. We know in our data the number of distinct MIDI numbers is 100, attime step t , denote the output logits obtained from generator sub-network as o t ∈ R , then we canobtain the next one-hot encoded pitch attribute y pt +1 by sampling: y pt +1 ∼ sof tmax ( o t ) . (3)Here, sof tmax ( o t ) represents the multinomial distribution on the set of all possible MIDI numbers.Because the sampling operation in (3) is not differentiable, this implies the presence of a step functionat the output of the sub-network. Since the derivative of a step function is 0, ∂loss G ∂θ pG = 0 , this isthe non-differentiability issue mitigated by applying the Gumbel-Softmax relaxation. The Gumbel-Softmax relaxation defines a continuous distribution over the simplex that can approximate samplesfrom a categorical distribution [15][16]. Applying the Gumbel-Softmax , we can reparameterize thesampling in (3) as ˆ y pt +1 = σ ( β ( o t + g t )) (4)where β > is a tunable parameter called inverse temperature , g ( i ) t is from the i.i.d standardGumbel distribution i.e. g ( i ) t = − log( − log U ( i ) t ) with U ( i ) t ∼ U nif orm (0 , . As ˆ y pt +1 in (4) isdifferentiable w.r.t. o t , we can use it instead of y pt +1 as the input to the discriminator. The discriminator has a relational memory based network. Its role is to distinguish between thegenerated sequence and the real sequence conditioned on the context. We continue to take melodysequence generation from lyrics to explain the stage of the discriminator with single relational network.At each time step t , the input to the discriminator network is the one-hot encoded representation ofeach music attribute (either real or generated) i.e. the pitch attribute y pt ∈ R , duration attribute y dt ∈ R , and rest attribute y rt ∈ R and the embedded representation of lyrics syllable x t ∈ R .4nitially, during the discriminator network forward pass, each music attribute y pt , y dt , y rt is indepen-dently passed through a linear layer to obtain a dense representation for each music attribute. Thedense representation for each music attribute is concatenated together with x t to form a syllableconditioned triplet of music attributes { y pt , y dt , y rt }. We then pass the syllable conditioned triplet ofmusic attributes { y pt , y dt , y rt } through a dense layer with ReLU activation. The outputs of the denselayer and the RMC memory M t − are passed through the RMC layer. The RMC output is passedthrough a linear layer with a single unit to obtain the output logits o t ∈ R .Since the length of sequences is T = 20 , we repeat this process for steps and generate a sequenceof output logits o = [ o , o , · · · , o T ] . We then take the mean of o and use it for the loss computation.Let o and ˆ o represent the output logits obtained for real and generated music attributes conditionedon lyrics are passed through the discriminator respectively. Then, the discriminator loss is given by loss D = log sigmoid ( 1 T T (cid:88) t =1 o t − T T (cid:88) t =1 ˆ o t ) (5)Here, we employ the relativistic standard GAN (RSGAN)[17] loss function. Intuitively, the lossfunction in (5) directly estimates the average probability that real melody is more realistic thangenerated melody. We simply can set the generator loss as loss G = − loss D .In the discriminator network, the embedding dimensions of pitch, duration, and rest are set to 32, 16,and 8 respectively. The fully connected layer following the embedding layer uses ReLU activationwith 64 units. The RMC layer following the fully connected layer contains a single memory slot withthe head size, the number of heads, and the number of blocks set to 64, 2, and 2 respectively. In this section, we discuss the experimental setup and experimental results to demonstrate thefeasibility of our proposed C-Hybrid-GAN. To evaluate our proposed architecture, we use Self-BLEU[18] to measure generated sample diversity and maximum mean discrepancy (MMD)[19] to measuregenerated sample quality. The effect of lyrics conditioning is also investigated. Melody-lyricsaligned dataset used in [1] is utilized in our experiment, which contains 13,251 sequences, witheach consisting of 20 syllables aligned with the triplet of music attributes { y pt , y dt , y rt }. The datasetis split into training, validation and testing sets with the ratio of 8:1:1. Conditional hybrid MLE(C-Hybrid-MLE) and conditional LSTM GAN (C-LSTM-GAN)[1] are used to compare with ourproposed C-Hybrid-GAN. We use the Adam optimizer with β = 0 . and β = 0 . . We perform gradient clipping if the normof the gradients exceeds 5. Initially, the generator network is pre-trained with the MLE objective for40 epochs with a learning rate of 1e-2. We then perform adversarial training for 120 epochs with alearning rate of 1e-2 for both the generator and discriminator. Each step of adversarial training iscomposed of a single discriminator step and a single generator step. The batch size is set to 512 and amaximum temperature β max = 1000 is used during the adversarial training. We use the Self-BLEU [18] score as a means to measure the diversity of melodies generated by ourproposed model. The value of the Self-BLEU score ranges between 0 and 1 with a smaller value ofSelf-BLEU implying a higher sample diversity hence a less chance of mode collapse in the GANmodel.Intuitively, the Self-BLEU score measures how a generated melody sample is similar to the rest ofthe generated melody samples. With respect to our proposed model, to compute the Self-BLEUscore we first combine the pitch, duration, and rest sequences generated by each generator sub-network to form a sequence of music attributes i.e. a melody. As an example let us assumethe sequences of pitches, durations, and rests generated by each corresponding sub-network is5igure 2: Training curves of self-BLEU scores on testing dataset. ˆ p = [ˆ p , ˆ p , · · · , ˆ p T ] , ˆ d = [ ˆ d , ˆ d , · · · , ˆ d T ] , ˆ r = [ˆ r , ˆ r , · · · , ˆ r T ] respectively. Then we canrepresent a melody as ˆ n = [ˆ p ˆ d ˆ r , ˆ p ˆ d ˆ r , · · · , ˆ p T ˆ d T ˆ r T ] .To compute the Self-BLEU score, we regard one generated melody as the hypothesis and the rest ofthe generated melodies as the references. We calculate the BLEU score for every generated melodyand define the average BLEU score to be the value of the Self-BLEU metric. The results of Self-BLEU are shown in Fig. 2. During the adversarial training, Self-BLEU values of our C-Hybrid-GANarchitecture reach the peak around 45 epochs, degrade until 100 epochs, and then approach to thestability. The results indicate that the diversity of generated melody samples gets better with thedecrease of Self-BLEU and keep unchanged from 100 epochs to 150 epochs. The quality of generated melodies is investigated using a MMD [19] unbiased estimator. The smallerMMD value indicates a better performance. As shown in Fig. 3, at each epoch, the generator outputsa sequence of pitches, a sequence of durations, and a sequence of rests. Using these generatedsequences of pitchs, durations, and rests together with the corresponding real sequences of pitches,durations, and rests we can compute MMD values of pitch, duration and rest respectively. The sumof these three values can be utilized to obtain the overall MMD of the testing set.During the adversarial training, we can see the sample quality, as measured by the MMD, for theproposed C-Hybrid-GAN model, first increases with the quick decrease of MMD value until 50epochs and then starts to approach the stability and keep unchanged until 150 epochs. Each trendindicated from MMD values of pitch, duration, or rest is consistent with that of the other two and theoverall trends of MMD value. The results demonstrate that the overall quality of generated melodiesis high as it has a low value of MMD.
To show the generated melodies are efficiently conditioned by lyrics, we follow the previous evaluationmethod proposed in [1], where an effect on the generated note duration and rest duration is studied.Average note duration distance between generated sequences and sequences from ground truth datasetis calculated in Fig. 4(a). Average rest duration distance between generated sequences and sequencesfrom ground truth dataset is calculated in Fig. 4(b). The subscripts rs , rn and rns respectively denote“random songs”, “random notes”, and “random notes + songs”. In this experiment, d is a real value,which is compared to the distribution of the random variables d rs , d rn and d rns , with N = 1 , (number of songs in testing set) and T = 20 .The three distributions are estimated using 10,000 samples for each random variable. As the resultsshown in Fig. 4, in each case, d is statistically lower than the mean value, indicating that the6igure 3: Training curves of of MMD scores on testing dataset. (a) Note duration attribute: d = 0 . is highlightedin red in each boxplot. Mean values are µ rs = 1 . , µ rn = 1 . and µ rns = 1 . respectively. (b) Rest duration attribute: d = 1 . is highlightedin red in each boxplot. Mean values are µ rs = 1 . , µ rn = 1 . and µ rns = 1 . respectively. Figure 4: Boxplots of the distributions of d rs , d rn and d rns .generator learns useful correlation between syllable embeddings and note/rest durations. For adetailed evaluation method of lyrics conditioning refers to [1]. To study if C-Hybrid-GAN can generate sequences that resemble the same distribution as trainingsamples, quantitative evaluation is performed to compare existing state-of-the-art approaches fol-lowing the previous quantitative measurements in [1], for example, 2-MIDI numbers repetitions,3-MIDI numbers repetitions, MIDI numbers span, the number of unique MIDI, the number of noteswithout rest, average rest value within song, and song length. More detailed descriptions for thesemeasurements can refer to [1].Table 1: Metrics evaluation of attributes.
Ground Truth C-LSTM-GAN C-Hybrid-GAN C-Hybrid-MLE2-MIDI numbers repetitions 7.4 7.7 6.6 6.33-MIDI numbers repetitions 3.8 2.9 2.8 2.4MIDI numbers span 10.8 7.7 12.0 13.5Number of unique MIDI numbers 5.9 5.1 6.0 6.2Average rest value within song 0.8 0.6 0.7 1.1Number of notes without rest 15.6 16.7 15.8 13.2Song length 43.3 39.2 43.2 51.9
Sequence generation from context has been an interesting research topic in the area of artiïˇn ˛Acialintelligence. The goal of this generation is to design generative models that can automatically infersequence when given context in a way similar to the human way. However, the current state-of-the-art generative models have the incapability of generating discrete-valued sequences with multipleattributes when given context.In this paper, we propose the novel conditional hybrid generative adversarial network for generatingsequence from context. Three independent discrete-valued sequences containing different attributesare exploited to learn context-conditioned sequence generation. In particular, the relational reasoningmethod is employed to learn the dependency inside independent sequence of the specific attributeduring the training stage of the generator as well as the consistency among all independent sequencesof attributes during the training stage of the discriminator. To avoid the non-differentiability problemin GANs for discrete data generation, we exploit the Gumbel-Softmax to approximate the distri-bution of discrete-valued sequences. Through extensive experiments of melody generation fromlyrics including diversity and quality of generated sequence samples, effect of lyrics-based contextconditioning, and comparison with existing works, we indicate that the proposed C-Hybrid GANoutperforms the existing cutting-edge methods in context-conditioned sequence generation withmultiple attributes. 8 eferenceseferences