AAnyone GAN Sing
Shreeviknesh Sankaran , Sukavanan Nanjundan , G. Paavai Anand
Department of Computer Science and Engineering,SRM Institute of Science and Technology
Chennai, [email protected], [email protected], [email protected]
Abstract —The problem of audio synthesis has been increas-ingly solved using deep neural networks. With the introductionof Generative Adversarial Networks (GAN), another efficient andadjective path has opened up to solve this problem. In this paper,we present a method to synthesize the singing voice of a personusing a Convolutional Long Short-term Memory (ConvLSTM)based GAN optimized using the Wasserstein loss function. Ourwork is inspired by WGANSing by Chandna et al. Our modelinputs consecutive frame-wise linguistic and frequency features,along with singer identity and outputs vocoder features. We trainthe model on a dataset of 48 English songs sung and spoken by12 non-professional singers. For inference, sequential blocks areconcatenated using an overlap-add procedure. We test the modelusing the Mel-Cepstral Distance metric and a subjective listeningtest with 18 participants.
Index Terms —Generative Adversarial Networks, Wasserstein-GAN, Convolutional-LSTM, Singing Voice Synthesis.
I. I
NTRODUCTION
The problem of singing voice synthesis is similar to that ofText-to-Speech (TTS) synthesis, but the former is much morecomplicated than the latter. The complexity mainly arises fromtrying to mimic an extensive range of pitches and phonemesinvolved in the process of singing. TTS synthesis is primarilycontrolled by the words or syllables from the text. On theother hand, singing voice synthesis is controlled by a scorecomponent in addition to the syllables from the lyrics of thesong. The score component determines the pitch and timing ofthe syllables from the lyrics; in other words, the score definesthe flow of a song.There are several models (Chandna et al., 2019 [1], Blaauwet al., 2019 [2], Hono et al., 2019 [3], Kaewtip et al., 2019[4], Lee et al., 2019 [5] and Tamaru et al., 2019 [6]) that havesuccessfully demonstrated the ability to synthesize singingvoices of different test subjects. Our model is inspired byWGANSing: A Multi-Voice Singing Voice Synthesizer Basedon Wasserstein-GAN by Chandna et al.Generative Adversarial Networks (GANs) have had im-mense success in modeling the distribution of highly complexdata and have produced state-of-the-art results in image gen-eration [7], [8]. GANs have also been used for TTS synthesisand other such audio generation problems [9], [10]. But, thenumber of adaptations of GANs in the audio domain is much co-first authors fewer when compared to the number of adaptations in thecomputer vision domain.The singing voice can be considered as a sequence as thereis a sequential flow of notes throughout a song. A song can beconstructed only if there is some connection between any twonotes throughout the song. Notes thrown around haphazardlywithout any real flow or connection between the notes can’tbe considered as “legitimate” songs, although some peoplemay find that attractive. This connection between notes canbe considered as a sequence, and thus the problem of singingvoice synthesis can be approached using sequence predictiontechniques such as Long Short-Term Memory (LSTM). Inthis paper, we propose a Convolutional-LSTM (ConvLSTM)based GAN with an architecture inspired by Chandna et al. tosynthesize the singing voice of a person. The choice of usingLSTMs stems from the fact that they can model and learnlong-range dependencies efficiently [11].Therefore, we present a block-wise generative model trainedusing the Wasserstein—GAN framework for singing voicesynthesis. The block-wise nature combined with the convo-lutional network component enables the model to identifytemporal dependencies, just like the inter-pixel dependenciesthat are identified by GANs in the case of image datasets.II. GAN AND W ASSERSTEIN -GANGAN belongs to the generative frameworks class of deeplearning. Since their inception, they have been widely usedin the computer vision domain to generate synthetic imagesand videos that are indistinguishable from real samples [12]–[15]. They consist of two networks (adversaries), a generatorand a discriminator which are trained simultaneously. Thediscriminator is trained to distinguish between synthesizeddata and real data, whereas the generator is trained to fool thediscriminator by synthesizing data that resembles real data.Training of GAN can be formulated as a minimax game [16].The discriminator, on the one hand, tries to maximize itsreward, and the generator, on the other hand, tries to minimizethe discriminator’s reward or, in other words, tries to maximizethe discriminator’s loss. The loss function for GAN is shownin Eq. (1). a r X i v : . [ c s . S D ] F e b GAN = min G max D E x ∼ P data ( x ) [ logD ( x )]+ E z ∼ P z ( z ) [ log (1 − D ( G ( z ))] (1)where G denotes the generator, D denotes the discriminator, x is a sample from the real distribution and z is the input tothe generator, which may be noise or some conditional inputand is taken from the distribution P z .As pointed out by Arjovsky et al., while GANs have beenefficient in generating images and videos, it has been notedthat the above minimax loss function can cause the GANto get stuck in the early stages of training when the job ofthe discriminator is easy. More problems, such as vanishinggradient, mode collapse and instability, arise. To overcomesuch difficulties, Wasserstein-GAN (WGAN) can be used [17].WGAN uses Earth-Mover distance as given in Eq. (2) tomeasure divergence between real and generated distributions.Moreover, WGANs use a critic instead of a discriminator. Thecritic does not classify inputs as real or fake; instead, it justapproximates a distance score between two given distributions(here, the real distribution and the generated distribution). W ( P r , P g ) = inf γ ∈ Π( P r , P g ) E ( x,y ) ∼ γ ( (cid:107) x − y (cid:107) ) (2)The critic can be optimally trained and it does not saturate,thus converging to a linear function. The loss function forWGAN is shown in Eq. (3). L W GAN = min G max D E y ∼ P r [ D ( y )] − E x ∼ P x [ D ( G ( z ))] (3)The loss functions for both the critic and the generatorbecome deceptively simple. The critic tries to maximize Eq.(4) – i.e., it tries to maximize the difference between its outputfor real data and its output for synthesized data. The generatortries to maximize Eq. (5) – i.e., it tries to maximize the critic’soutput for fake or synthesized data. L C = D ( x ) − D ( G ( z )) (4) L G = D ( G ( z )) (5)where D ( x ) represents the critic’s output for a real instance, G ( z ) represents the generator’s output when given noise z asinput or any other conditional input z and D ( G ( z )) representsthe critic’s output for a fake instance.We use an extension of GAN model called ConditionalGAN (CGAN) which takes an additional conditional vector asinput [18]. Adding this conditional vector controls the outputand guides the generator in modeling a probability distributioncontrolled by the vector. The framework is mentioned in Fig.5 in Sec. V. We use the same training algorithm mentioned inthe GAN paper by Goodfellow et al. III. LSTM AND C ONVOLUTIONAL -LSTMLong Short-Term Memory (LSTM) is an Recurrent NeuralNetwork (RNN) architecture that has been extensively used forvarious applications in Natural Language Processing (NLP)such as speech recognition and semantic parsing [19]. LSTMsare capable of learning order dependence and long-rangedependencies in sequence prediction problems [11]. An LSTMunit is composed of a cell, an input gate, an output gate anda forget gate as shown in Fig. 1. The cell remembers valuesover arbitrary time intervals and the three gates regulate theflow of information into and out of the cell.
Fig. 1. A basic LSTM cell.
On the other hand, a Convolutional Neural Network (CNN)is a deep learning algorithm that is predominantly used in com-puter vision applications [20]. CNN is a regularized version ofmultilayer perceptron that is capable of efficiently extractingfeatures and learning them. There are two parts to a CNN:convolution layers and a fully connected neural network thatuses the output of the convolutions to predict the output. Anexample CNN is shown in Fig. 2.
Fig. 2. A simple CNN architecture.
Convolutional Long Short-Term Memory (ConvLSTM) isan LSTM cell containing a convolution operation embeddedinside it as shown in Fig. 3. It is an LSTM architecturedesigned explicitly for sequence prediction problems withspatial data, such as images or videos [21].To take advantage of the abilities of both LSTM and CNN,we use ConvLSTM. ConvLSTM networks are capable oflearning long-range dependencies and extracting importantfeatures from data, both of which are required in the problemof singing voice synthesis. ig. 3. A ConvLSTM cell.
IV. R
ELATED W ORK
The Neural Parametric Singing Synthesizer (NPSS) byBlaauw et al., is a modified version of WaveNet [22] whichuses autoregressive architecture. The model features are pro-duced by a parametric vocoder that separates the influence ofpitch and timbre. As a result, this helps in training the modelwith datasets of comparatively smaller size while producinghigh-quality results, which are comparable to or sometimeseven better than state-of-the-art concatenative methods.Hono et al. present two methods for singing voice synthesis:one is a GAN-based architecture and the other is a conditionalGAN-based architecture. This models the inter-frame depen-dencies as opposed to the inter-block dependencies that aremodeled by our model. This difference helps our model toproduce more robust results than that of the model presentedby Hono et al.WGANSing by Chandna et al., which is the inspiration forour model, presents a multi-singer singing voice synthesizer.It uses an encoder-decoder based schema for the generator andan encoder schema for the discriminator network. The modelproduces results that are comparable to that of state-of-the-artmodels (NPSS in this case). They mention that the synthesisquality can be improved by using previously predicted blockof features as a condition to the current batch of featureswhich we have done so by using a ConvLSTM based GANarchitecture for singing voice synthesis.V. P
ROPOSED S YSTEM
We adopt the same architecture used in the WGANSingpaper, which is similar to the DCGAN. The main reason forthis choice was to establish a baseline and make compar-isons easier between models. One main difference betweenthe WGANSing architecture and our architecture is that ourgenerator network uses ConvLSTM layers instead of the CNNlayers.For the generator network, we use an encoder-decoder ar-chitecture consisting of 5 ConvLSTM layers each, as shown in Fig. 4. The whole network is similar to the U-Net architecturethat is used for Biomedical Image Segmentation [23]. Forthe discriminator, we use the encoder block of the generatornetwork alone. It asserts the presence of dependencies withina block.
Fig. 4. The encoder-decoder architecture used in the generator network.
In the encoder block, we use fractionally-strided-convolutions instead of deterministic pooling functions. Forexample, if a 6x6 pixel image is processed by setting thestride to 3 and the kernel to 3x3, the resulting image is 2x2 inresolution. The inverse of this process begins by determiningthe spatial resolution and then performing the convolution.While it is not a mathematical inverse, the process is stilluseful in specific encoding mechanisms. Using this methodincreases the model’s expressiveness ability.Furthermore, the encoder-decoder schema leads to condi-tional dependence between the features of the generator output,within the predicted block of features. This approach impliesimplicit dependence between the features of a single block butnot within the blocks themselves. Therefore, for inference, weuse overlap-add of consecutive blocks of output features.
Fig. 5. The framework for the proposed model. A conditional vector is inputto the generator. The critic is trained to identify a real sample.
As shown in Fig. 5, the generator network inputs a condi-tional vector consisting of random noise, identity, phonemes asa one-hot vector and f0 contour. Using this conditional GAN,the singing voice of a person is generated.VI. D
ATASET
The dataset used is the NUS-48E Sung and Spoken LyricsCorpus developed at the Sound and Music Computing Lab-oratory at the National University of Singapore [24]. Theorpus is a 169-min collection of recordings of the sungand spoken lyrics of 48 (20 unique) English songs by 12non-professional singers and a complete set of transcriptionsand manual duration annotations at the phone-level for allrecordings of sung lyrics, comprising of a total of 25,474phone instances.The corpus consists of 12 folders, one for each subject.Each of these folders consists of “sing” and “read” folderswhich consist of 4 sung and corresponding spoken .wav filesand their time-aligned phone-level annotations in .txt files. The.wav and .txt files are converted into .hdf5 (hierarchical dataformat) files to make them easily accessible in Python. Thesefiles contain the phonemes and features of each corresponding.wav and .txt files, and thus the features can be used as inputsfor the model.VII. E
VALUATION M ETHODOLOGY
For objective evaluation, we use the Mel-Cepstral Distancemetric [25] as shown in Eq. (6) and the results are presentedin Tab. I. For subjective evaluation, we asked participants tolisten to the songs generated by both the models and evaluatethem on Audio Quality, Intelligibility and Overall Score. Wecompared our model to the WGANSing model, both trainedon the same dataset and for the same number of epochs – 750.
M C = 10 √ T T (cid:88) t =1 (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) i (cid:16) C ti − ˆ C ti (cid:17) (6)We chose a total of 6 songs for the listeners, two songsof each gender without any voice change, two songs of eachgender with voice change among the same gender and, twosongs of each gender with voice change among oppositegenders – i.e., male voiced song synthesized with a femalevoice and vice-versa. VIII. R ESULTS
There were a total of 18 participants, who were all non-native English speakers and with ages in the range 20-22 inour study. The results of the study are shown in Figs. 6, 7 and8.
Fig. 8. Subjective test results for Overall Score. Fig. 6. Subjective test results for Audio Quality.Fig. 7. Subjective test results for Intelligibility.
From the above figures, it is observed that our modelperforms slightly better than the WGANSing model on allthree attributes. This result is further corroborated by theobjective measure presented in Tab. I.From Figs. 6 and 7, it is also observed that while bothmodels scored similarly on intelligibility, our model performsbetter when compared to WGANSing in terms of audio quality.This phenemenon can be mainly attributed to the fact thatthe songs generated with WGANSing had considerable noisebetween words or during pauses in the song. However, no suchnoise was heard in the songs generated by our model.It is also observed that the model’s performance withoutvoice change was better than the model’s performance withvoice change. The model’s performance further declined whenthere was a voice change between different genders. Yet, evenduring the voice change, our model’s performance was betterthan that of the WGANSing model.
TABLE IMCD R
ESULTS
Song ConvLSTM model Base modelMPUR 03 18.7567 dB 21.0440 dBSAMF 13 14.3638 dB 14.6363 dB
X. C
ONCLUSION