Analysing Deep Learning-Spectral Envelope Prediction Methods for Singing Synthesis
aa r X i v : . [ ee ss . A S ] M a r Analysing Deep Learning-Spectral EnvelopePrediction Methods for Singing Synthesis
Frederik Bous
UMR STMS - IRCAMCNRS, Sorbonne University
Paris, [email protected]
Axel Roebel
UMR STMS - IRCAMCNRS, Sorbonne University
Paris, [email protected]
Abstract —We conduct an investigation on various hyper-parameters regarding neural networks used to generate spectralenvelopes for singing synthesis. Two perceptive tests, wherethe first compares two models directly and the other ranksmodels with a mean opinion score, are performed. With thesetests we show that when learning to predict spectral en-velopes, 2d-convolutions are superior over previously proposed1d-convolutions and that predicting multiple frames in an iteratedfashion during training is superior over injecting noise to theinput data. An experimental investigation whether learning topredict a probability distribution vs. single samples was per-formed but turned out to be inconclusive. A network architectureis proposed that incorporates the improvements which we foundto be useful and we show in our experiments that this networkproduces better results than other stat-of-the-art methods.
Index Terms —Singing synthesis, spectral envelopes, deep learn-ing I. INTRODUCTION
Singing synthesis is concerned with generating audio thatsounds like a human singing voice from a musical descriptionsuch as midi or sheet music. Compared with other musicalinstruments we observe that the human voice has one of thegreatest varieties of possible sounds and the human ear istrained to distinguish the smallest differences in human voices.Human voice is among the first things a human learns andremains much apparent in our everyday life, and thereforealmost everyone is a born expert in perceiving human voice.Compared with acoustic instruments, singing not only incor-porates melody and articulation, but also text. Compared withspeech, singing requires special treatment of the fundamentalfrequency f as well as timing, which must be aligned to matchmelody and rhythm respectively. However, due to its similarityto speech synthesis, more precisely text-to-speech (tts), manymethods from tts may also be applied to singing synthesis.For years concatenative methods [1], [2] dominated both fields[3]–[6]. While these techniques yield fairly decent results, theyare inflexible and the underlying parametric speech modelsusually treat all parameters independently, which poses diffi-culties with coherency of the parameters. However, today fastcomputation on gpus and large databases allow us treating allparameters at once in a single model with neural networks andthey have already been successfully applied to text-to-speechapplications in the past years: The system of [7] uses recurrent neural networks to modelthe statistic properties needed for their concatenative synthesis.WaveNet [8] goes further and models the raw audio, ratherthan concatenating existing audio or using a vocoder. Shortlyafter that, end-to-end systems like Tacotron [9] and Deep Voice[10] were developed which create raw audio from input onphoneme level or even character level. The authors of [11]used the architecture of [8] to learn input data for a parametricsinging synthesizer.While WaveNet processes data that is inherently one di-mensional (i. e., raw audio), spectral envelopes are generatedin [11]. There the input data is thus multidimensional, theauthors use parameters to represent the spectral envelopes.This changes the nature of the data and former strong pointsof WaveNet may loose importance whereas some weaknessesmay have a more significant impact. This has motivated ourinvestigation into alternative network topologies and trainingstrategies which finally has lead to an improved synthesismodel.We found that, contradicting the assumptions in [11], 2d-convolutions yield better perceived audio while reducing therequired number of trainable parameters. We also observe thatlearning by predicting multiple frames successively is superiorto learning with additive noise at the input. A clear benefitfrom predicting parametric distributions rather than samplesexplicitly could not be found. As a result we propose our ownnetwork for predicting spectral envelopes.The paper is structured as follows: we will first introduceour network in section II and discuss its differences to ex-isting systems in section III. The experimental setup will beexplained in section IV and we present the results from ourperceptive test in section VII. PROPOSED NETWORK ARCHITECTURE
We aim to build a system for composers and professionalsthat wish to use synthetic singing voice in their compositionsand applications. For us it is thus very important to keep a lotof flexibility in the system. While making application easy byautomating obvious decisions, there should be as much abilityto tweak all kinds of parameters as possible. Therefore end-to-end systems like Tacotron 2 [9], where only the raw text isused as input and raw audio comes out as output and all other reviousSpectralEnvelopes z − Dilated conv. × Concat.Dilated conv. × Embedding f -curvePhonemesequenceloudness-curve Dilated conv. × Dilated conv. × EmbeddingEmbedding DenseNetDilated conv. × CurrentSpectralEnvelopesDilated conv. × × Dilated conv. × Dilated conv. × Dilated conv. × × Dilated conv. × Dilated conv. × Dilated conv. × × Dilated conv. × Dilated conv. × Dilated conv. × × Dilated conv. × Dilated conv. × DenseNetDilated conv. × × × DenseNetDilated conv. × DenseNetDilated conv. × Fig. 1. Schematic layout of the network. Blue stacks denote stacks of layers, where × × means three stacks of four layers each , × means one stackwith six layers . In each stack the dilation rate is doubled in each layer starting with a dilation rate of . The block z − denotes a delay of one time step.Concatenation is done in the feature dimension. properties are only implicitly included in the model, if at all,do not fit our needs.The role of the fundamental frequency f is a very differentin singing synthesis as compared to speech synthesis. Inspeech, the f -curve follows only few constraints but needs tobe coherent with the other parameters. Learning it implicitlymakes sense for end-to-end text-to-speech application as itdoes not carry much information, but coherence with other pa-rameters is important. In singing, the f -curve is the parameterresponsible for carrying the melody but it carries also musicalstyle and emotion [12]. It is therefore important to model itexplicitly, which can be achieved with, e. g., B -splines [13],to still be able to tweak it by hand to fit the needs of theparticular application.While systems like WaveNet [8] operate on raw audio, thesearchitectures require very large datasets, which are currentlynot available for singing voice. This is for one due to lessfunding and for the other that recording proper singing requireseven more work, as professional singers can not sing as longin one session as a professional speaker could speak.In our application we use a vocoder model for singingvoice synthesis. We use an improved version of the SVLNvocoder [14], [15], that is used to create singing voice fromthe modified parametric representation of the singing signalstored in a singing voice database. In this context we aim touse a neural network to provide spectral envelopes that fit thelocal context (phoneme, F0, loudness) to the SVLN vocoder,that would then be responsible to generate the correspondingsource signal. A. Input data
Training data has been obtained from our own dataset ofsinging voice, which was originally created for a concatenativesinging synthesizer. It consists of about 90 minutes of singingfrom a tenor voice. From this database we extract the spectralenvelopes as well as the phoneme sequences, f -curves andloudness-curves as control parameters.Phonemes were aligned with [16] and manually adjusted.Spectral envelopes are extracted from the audio with an im-proved version of the true envelope method [17]. The loudness curve is extracted by using a very simple implementation ofthe loudness model of Glasberg et al. [18], the f -curve isextracted by the pitch estimator of [19]. All data is given witha sampling rate of ( step size).The spectral envelopes are represented by 60 log Mel-frequency spectral coefficients [20], such that we can treatthe spectral envelope sequence as a 2d spectrogram with Mel-spaced frequency bins. To obtain 60 bins but to keep a goodresolution in the lower bins, we consider only frequencies upto . B. Spectral Envelope Generation
The spectral envelopes are generated by a recursive con-volutional neural network. The network predicts one spectralenvelope at a time by using the previous spectral envelopes aswell as a window of the phoneme, f and loudness values ofprevious, current and next time steps. We thus let the networkto see a large context of control parameters from both futureand past and thus allow it to create its own encoding.The architecture is inspired by [9] and [10]. These systemsuse a neural network to create Mel-spectra, which are thenconverted to raw audio by either the Griffin-Lim algorithm ora vocoder. However, since we model the f curve separatelyand do not encode it in the output, we can use a much simplermodel.Since all input parameters are given with the same rate,there is no need for attention. Only an encoding networkof the control parameters, a pre-net for the previous spectralenvelopes and a frame reconstruction network remain. In allparts we use blocks of dilated convolutions with exponentiallygrowing dilation rate [8], but additionally to dilated convo-lutions in time direction (as used in WaveNet and [11] andwhich we shall call (2 × -dilated convolutions) we also usedilated convolution in the frequency direction ( (1 × -dilatedconvolutions).We can summarise the architecture as follows (cf. Fig. 1): • Input the envelopes from the last n e time steps and n i phoneme-values, f -values and loudness-values (where n i is different for each parameter) from a window ofprevious, current and next time steps around the currentime step. The phonemes are mapped to frequency binsby an embedding layer, f -values and loudness-values aremapped to 60 frequencies by outer products. • For each input parameter use a stack of n i dilatedconvolution layers of (2 × -convolutions and dilationrate (2 l , in layer l (starting with l = 0 ). No zeropadding is done here. The convolution for the envelopesis deterministic, the other convolutions are symmetric. • After the time-convolutions, the time-dimension is nowone, while the frequency dimension remained for eachinput parameter. We concatenate all outputs from thetime-convolutions along the feature dimension. • The new frame is generated from the concatenation byseveral stacks of dilated convolution in the frequency-direction and with DenseNet skip-connections and bottle-neck layers [21]. We use three stacks of four layers anduse zero padding to keep the frequency dimensions. • The final output is produced by a (1 × convolution withone filter and adding the result to the previous frame. Wethus only learn the difference from the previous frame tothe next frame. C. Training
The number of layers for the stacks of dilated convolutionsin time-direction is for the spectral envelopes, for thephonemes and for both the f and loudness.We train the model using the adam optimizer [22] with β = 0 . and β = 0 . , just like in the original paper,but with an initial learning rate or · − and decay rateof − · − per update (batch). We feed the networkwith minibatches consisting of 16 samples each chosen fromrandom locations.The loss is obtained as a simple mean squared error (mse) ofthe log amplitudes of the individual frequency bins. Other errorfunctions like the mean absolute error or Sobolev norms (sumsof L p norms and L p norms over its derivatives) were alsoconsidered but we found that results did not differ significantly.III. D IFFERENCES WITH E XISTING M ODELS
A. 2d vs. 1d Convolutions
The authors of [11] claim that “the translation invariancethat 2d convolutions offer is an undesirable property for thefrequency dimension”. Although in fact we do not expect tosee every formant in each frequency bin with equal probability,formants can be found at different frequency locations. To beable to reduce the formants representation, we need to be ableto shift the filters in time and frequency.To prove our claim, we build a 2d version of the WaveNet-style network from [11] and compare it to the original versionto show that it yields in fact better audio.The 2d version of [11] replaces (2) dilated 1d convolutionswith dilation rates of l with (2 × dilated 2d convolutionswith dilation rate of (2 l , l ) . We reduce the number of filtersdramatically so that we now have less trainable parameters(about one third) as compared to the original model but stillmore features per time step. B. Predicting Distributions
It is common practice in prediction to learn to predict dis-tributions rather than samples. Distributions allow modellingdata that is uncertain or noisy. In the case of WaveNet [8] thesystem models a time series that is a mix of a periodic signaland coloured noise. The coloured noise cannot be modelled bya deterministic system and therefore predicting a distributionand sampling from it is necessary.Reference [11] use the WaveNet architecture to generate notaudio, but spectral envelopes. Their system predicts parametersof a constrained Gaussian mixture to generate an independentparametric probability distribution for each frequency bin ofthe spectral envelope. However there are some very importantdifferences between raw audio and spectral envelopes: rawaudio (as modelled by WaveNet) has only one dimension pertime step while spectral envelopes are modelled (here) with frequency bins. Raw audio is rapidly changing, containsoscillations and coloured noise, while spectral envelopes arenot oscillating, slowly changing and not noisy.Since one time step of spectral envelopes contains frequency bins, it is impossible to model all correlations of allfrequency bins. This is typically not necessary, as correlationsbetween frequency bins that are far apart can be assumed tobe insignificant. Nevertheless, there are correlations betweenneighbouring frequency bins that cannot be neglected, if thegoal is to model the actual probability distribution of the spec-tral envelopes. Generating independent parametric distribu-tions for each frequency bin F i (as is done by [11]) must eitherassume that the frequency bins are independent (which they arenot) or in fact yield an approximation of the true distributionby the conditional expectations ˜ F i = E ( F i |{ F j : j = i } ) .This is however the uninteresting part of the distribution. Theconditional expectation ˜ F i describes the independent noisein each frequency band while multiple possible positions offormants are not modelled at all.Since spectral envelopes are not noisy, we believe that itis not necessary at all, to predict probability distributions.Our approach generates a spectral envelope directly. Thiscan be seen as generating the most probable sample fromthe unknown (and unfeasible) distribution of the spectralenvelopes. C. Stability by iterated prediction
One problem with recursive models is stability. Duringprediction the error accumulates over time and once strayedtoo much from the path, there is no way to recover, becausethe system is in a state which it has never seen during training.It is also worth noting, that the envelopes do not change muchduring phonemes, but change more rapidly during a phonemechange.To learn to make good predictions over a long time, a typicalapproach is to add noise to the input envelopes to simulateenvelopes that have been previously predicted improperly orpredicted properly, but were not contained in the training set.However the noise level needs to be very high and thus reducesthe quality of the training data (Reference [11] suggests a noise
ABLE IT
HE DIFFERENT MODELS THAT WERE TRAINED FOR THE PERCEPTIVETESTS .Name Architecture Conv. Loss Data Augmentation
BB1
Blaauw & Bonada 1-d CGM a noise BB2
Blaauw & Bonada 2-d CGM noise
MSE
Bous & Roebel 2-d MSE b iterated CGM
Bous & Roebel 2-d CGM iterated iter
Bous & Roebel 2-d MSE iterated noise
Bous & Roebel 2-d MSE noise a constrained Gaussian mixture from [11] b mean squared error level of of the value range). Instead, we enforce stabilityby iteratively predicting dozens of frames for each batch andapplying the loss function to all predicted envelopes. This waywe force the network to consider long term evolution andrecover from prediction errors that are more likely to occur.IV. E XPERIMENTAL S ETUP
To support our claims from Section III and to show thatour network works well, we have conducted two perceptivetests with several different models: a direct comparison and amos-test.We train the networks on our singing database [3] consistingof roughly short phrases, and additional recordings ofvarious pitches, loudnesses and crescendi, as well as shortexcerpts from real songs, from a single tenor voice, totallingabout 90 minutes of singing voice. We split these recordingsinto training and testing files, where for each model we usethe same training and testing files.To regenerate the spectral envelopes with models thatpredict a probability distribution, we use the constrainedGaussian mixture from [11] with a generation temperature of τ = 0 to minimise sampling noise.To obtain raw audio we resynthesize the testing files withthe SVLN vocoder [14], [15] by replacing the original spectralenvelopes with the regenerated envelopes. We also includeresynthesis with ground truth envelopes by resynthesizing thetesting files without replacing the envelopes, thus resulting ina vocoder round trip. This procedure ensures that differencesin the audio are exclusively due to differences in the spectralenvelopes that are used, and not to the use of the vocoderitself.Two evaluate each of the proposed changes we perform adirect comparison of two models, that differ only with respectto the single hyper-parameter subject to testing. Given ourthree modifications we evaluate • the use of 2d versus 1d convolution by means of compar-ing our reimplementation of [11] and our modification asdescribed in Section III-A, • the advantage of modelling predictions as probabilitydistributions by means of comparing a model trainedwith mse-loss with another trained to maximise the log-likelihood of the distribution of predicted samples, TABLE IIP
ERCEPTIVE TEST RESULTS OF DIRECT COMPARISON . T
HE PREFERENCEIS GIVEN TOWARDS THE LEFT MODEL , I . E ., A POSITIVE NUMBER IMPLIESA PREFERENCE TOWARDS THE FIRST MODEL . T HE p - VALUE IS THERESULT OF A ONE - SIDED t - TEST .Comparison Preference p -Value Preference p -Value(French) (French) (all) (all) BB2 vs.
BB1 +1 .
44 0 .
03% +0 .
94 0 . CGM vs.
MSE +0 .
00 50 .
00% +0 .
32 3 . iter vs. noise +0 .
78 1 .
51% +0 .
48 0 . TABLE IIIP
ERCEPTIVE TEST RESULTS FOR MEAN OPINION SCORES ( MOS ) WITH CONFIDENCE INTERVALS .Model Mos (French) Mos (all) iter . ± .
38 3 . ± . noise . ± .
33 3 . ± . BB1 . ± .
38 2 . ± . BB2 . ± .
46 3 . ± . Ground truth . ± .
53 3 . ± . • iterated training by means of comparing a model thatwas trained with a single prediction with noise of standard deviation added to the input log-spectrum (the for the noise were found to work best among thevalues that were tested), and another model that trainedrecursively performing iterated predictions withoutany noise was added to the input.To identify if the hyper-parameter is useful for overall quality,participants of our test were given the same phrase from bothmodels and were asked to give a preference from − to .The mean opinion score has been measured by asking theparticipants to rank the given phrases on a scale from to ,where was the worst and was the best. Each participantwas given the same phrase from all five models, but thephrases may differ for each participant. The models we usedare summarised in Table I. For the mos test the followingmodels were used: the two models from the 2d/1d comparison( BB1 and
BB2 ), the two models from the iterated vs. trainingwith input noise comparison ( iter and noise ), and a resynthesiswith ground truth envelopes.The survey was carried out online. We received submis-sions from various backgrounds. Of those submissions, were from native French speakers.V. R ESULTS
Preferences of native French speakers are listed separatelybecause the phrases were in French language. We can seethat native French speakers were more critical (apparent in themos test, cf. Table III). This may be because native Frenchspeakers could additionally consider the pronunciation, andpronunciation may still not be as good (in the feed back itwas actually mentioned that the singing voices seem to havekind of an accent).
A. Comparison Test
Table II shows the results from the comparison test. The“Preference” column contains the mean of the preferencealues that were submitted towards the left model, i. e., in thecomparison a vs. b positive values mean that a was preferred,negative values mean that b was preferred. The “ p -Value”column contains the p -value of the one sided Student- t -test,i. e., the probability that, the data was generated under thealternative hypothesis (“the right model was better or equal”).There is a very clear preference towards the use of 2-d convolutions among both native French speakers and allparticipants in total. The p -value of . actually means thatthe p -value was below . , which was rounded down to . Also a strong preference was given towards the iteratedtraining method. No clear preference could be deduced forthe choice of loss function. While a slight (and significant)preference was given by all participants in total, no prefer-ence was found among native French speakers. Incidentallythe preference values add up to zero, however there weresubmissions with both negative and positive preference. B. Mean Opinion Score Test
Table III shows the results from the MOS test. The givenvalues are the mean of the submitted scores plus/minus half ofthe confidence interval obtained by a two-sided Student- t -test.The preferences are not as clear as compared to the com-parison test. While the relative preferences cannot be acceptedwith a p -value of among native French speakers, thepreference against the state of the art BB1 is supported bythe preferences of all participants.The confidence intervals are rather large due to the admit-tedly small number of participations. The inferior conclusive-ness of the MOS test can be explained by its design: Duringthe MOS test the participants were exposed to the (almost)same recording five times. While they might have heard somedifferences among the individual versions, they were muchmore inclined to put them in the same category because theywere still very similar, than in the comparison test, where theywere explicitly asked to favour one recording over the other.VI. C
ONCLUSIONS
In this paper we introduced a neural network architecturethat is able to generate spectral envelopes for singing synthesisusing a vocoder model. We showed in perceptive tests thatthe modifications we made with respect to the state-of-the-art method are useful in improving the perceptive result. Inparticular we showed that 2d convolutions are beneficial inmodelling spectral envelopes and iteratively predicting multi-ple frames during training is superior to simply injecting noiseat the input. An investigation whether predicting probabilitydistributions rather than single samples was also carried out,but no benefit could be found when evaluating among nativeFrench speakers.VII. A
CKNOWLEDGMENTS
We like to thank Merlijn Blaauw for helpful discussionssupporting our implementation of [11] R
EFERENCES[1] E. Moulines and F. Charpentier, “Pitch-synchronous waveform pro-cessing techniques for text-to-speech synthesis using diphones,”
Speechcommunication , vol. 9, no. 5-6, pp. 453–467, 1990.[2] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speechsynthesis system using a large speech database,” in
Acoustics, Speech,and Signal Processing, 1996. ICASSP-96. Conference Proceedings.,1996 IEEE International Conference on , vol. 1. IEEE, 1996, pp. 373–376.[3] L. Ardaillon, “Synthesis and expressive transformation of singing voice,”Ph.D. dissertation, EDITE; UPMC-Paris 6 Sorbonne Universit´es, 2017.[4] J. Bonada, M. Umbert, and M. Blaauw, “Expressive singing synthesisbased on unit selection for the singing synthesis challenge 2016.” in
INTERSPEECH , 2016, pp. 1230–1234.[5] H. Kenmochi and H. Ohshita, “Vocaloid-commercial singing synthesizerbased on sample concatenation,” in
Eighth Annual Conference of theInternational Speech Communication Association , 2007.[6] X. Gonzalvo, S. Tazari, C.-a. Chan, M. Becker, A. Gutkin, and H. Silen,“Recent advances in google real-time hmm-driven unit selection synthe-sizer.” in
Interspeech , 2016, pp. 2238–2242.[7] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepa-niak, “Fast, compact, and high quality lstm-rnn based statisticalparametric speech synthesizers for mobile devices,” arXiv preprintarXiv:1606.06061 , 2016.[8] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio.” in
SSW , 2016, p. 125.[9] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,Y. Zhang, Y. Wang, R. Skerry-Ryan et al. , “Natural tts synthesis byconditioning wavenet on mel spectrogram predictions,” arXiv preprintarXiv:1712.05884 , 2017.[10] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,J. Raiman, and J. Miller, “Deep voice 3: 2000-speaker neural text-to-speech,” arXiv preprint arXiv:1710.07654 , 2017.[11] M. Blaauw and J. Bonada, “A neural parametric singing synthesizermodeling timbre and expression from natural songs,”
Applied Sciences ,vol. 7, no. 12, p. 1313, 2017.[12] L. Ardaillon, C. Chabot-Canet, and A. Roebel, “Expressive control ofsinging voice synthesis using musical contexts and a parametric f0model,” in
Interspeech 2016 , vol. 2016, 2016, pp. 1250–1254.[13] L. Ardaillon, G. Degottex, and A. Roebel, “A multi-layer f0 model forsinging voice synthesis using a b-spline representation with intuitivecontrols,” in
Interspeech 2015 , 2015.[14] G. Degottex, P. Lanchantin, A. Roebel, and X. Rodet, “Mixed sourcemodel and its adapted vocal-tract filter estimate for voice transformationand synthesis,”
Speech Communication , vol. 55, no. 2, pp. 278–294,2013.[15] S. Huber and A. Roebel, “On glottal source shape parameter transfor-mation using a novel deterministic and stochastic speech analysis andsynthesis system,” in
Proc InterSpeech , 2015.[16] P. Lanchantin, A. C. Morris, X. Rodet, and C. Veaux, “Automaticphoneme segmentation with relaxed textual constraints.” in
Proc. of TheInternational Conference on Language Resources and Evaluation , 2008.[17] A. R¨obel, F. Villavicencio, and X. Rodet, “On cepstral and all-polebased spectral envelope modeling with unknown model order,”
PatternRecognition Letters, Special issue on Advances in Pattern Recognitionfor Speech and Audio Processing , vol. 28, no. 6, pp. 1343–1350, 2007.[18] B. R. Glasberg and B. C. Moore, “A model of loudness applicable totime-varying sounds,”
Journal of the Audio Engineering Society
Third International Conference on Spoken Language Processing , 1994.[21] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks.” in
CVPR , vol. 1, no. 2, 2017, p. 3.[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980