[PDF] HpRNet : Incorporating Residual Noise Modeling for Violin in a Variational Parametric Synthesizer

Abstract

Generative Models for Audio Synthesis have been gaining momentum in the last few years. More recently, parametric representations of the audio signal have been incorporated to facilitate better musical control of the synthesized output. In this work, we investigate a parametric model for violin tones, in particular the generative modeling of the residual bow noise to make for more natural tone quality. To aid in our analysis, we introduce a dataset of Carnatic Violin Recordings where bow noise is an integral part of the playing style of higher pitched notes in specific gestural contexts. We obtain insights about each of the harmonic and residual components of the signal, as well as their interdependence, via observations on the latent space derived in the course of variational encoding of the spectral envelopes of the sustained sounds.

Full PDF

HHPRNET : INCORPORATING RESIDUAL NOISE MODELING FORVIOLIN IN A VARIATIONAL PARAMETRIC SYNTHESIZER

Krishna Subramani, Preeti Rao

Indian Institute of Technology Bombay [email protected]

ABSTRACT

Generative Models for Audio Synthesis have been gainingmomentum in the last few years. More recently, parametricrepresentations of the audio signal have been incorporatedto facilitate better musical control of the synthesized out-put. In this work, we investigate a parametric model forviolin tones, in particular the generative modeling of theresidual bow noise to make for more natural tone quality.To aid in our analysis, we introduce a dataset of CarnaticViolin Recordings where bow noise is an integral part ofthe playing style of higher pitched notes in speciﬁc ges-tural contexts. We obtain insights about each of the har-monic and residual components of the signal, as well astheir interdependence, via observations on the latent spacederived in the course of variational encoding of the spectralenvelopes of the sustained sounds.

1. INTRODUCTION

Physical and Spectral Modeling Synthesis are modeldriven audio modeling procedures. “Neural Audio Synthe-sis” changes the game to that of using data-driven basedlearning approaches to audio synthesis. Saroff et al. [22],Roche et al. [19] and Esling et al. [8] approached gener-ative synthesis through frame-wise spectral autoencoding,along with additions (like architectural variations, regular-ization) for more controllable synthesis. Instead of directlymodeling the spectrum, Engel et al. [7], Wyse et al. [28]and Défossez et al. [4] synthesize audio in the time domain,either autoregressively or with RNNs/LSTMS. With the re-lease of the NSynth dataset [7] researchers were able to ap-proach synthesis with deep(er) generative models, with thedesire to obtain ﬂexible control over the musical attributeslike timbre, pitch and loudness.Audio can be modelled parametrically in a manner thatperceptually relevant parameters become available for mu-sical control over the synthesized sound. A good demon-stration of this is the Harmonic plus Residual (HpR) mod-eling by Serra et al. [23, 24] depicted in Figure 1. The ideais to decompose a signal into a sum of sinusoids whose fre-quencies are integer multiples of a fundamental frequency, c (cid:13) Krishna Subramani, Preeti Rao. Licensed under a Cre-ative Commons Attribution 4.0 International License (CC BY 4.0). h ( t ) + r ( t ) HpR h ( t ) r ( t ) Figure 1 : Harmonic plus Residual modeland a residual. Consider the audio signal as s ( t ) , s ( t ) = R (cid:88) r =1 A r ( t ) cos( θ r ( t )) + r ( t ) = h ( t ) + r ( t ) , where the ﬁrst term h ( t ) is the harmonic component, andthe second term r ( t ) is the residual. The residual is inessence that part of the audio signal that cannot be repre-sented by a sum of harmonic partials with R = number ofpartials used. Examples in musical instruments involve thebreathy sound when playing the ﬂute and the the scratchysound the bow makes when it moves against the violinstring during note sustain regions.The advantage of these parametric models are that theydo not require us to model the audio waveform or spec-trum directly, rather we can work in the reduced parametricspace. Combine this with the generative modeling capabil-ities of a neural network, and you can obtain a powerfulaudio synthesizer, one that can rely on small, simple net-work architectures, can be trained with lesser data, and thatcan potentially generate high quality audio with musicallyrelevant control over it. Engel et al. [6] realized this withtheir Differential Digital Signal Processing pipeline, whichused an autoencoder coupled with the HpR model. Subra-mani et al. [27] also combine the same parametric repre-sentation with a variational model for controlled synthesisof violin sounds. Neither of the above explicitly considersthe modeling of the residual signal.The violin is a popular instrument, both in Western andIndian music. What makes it a popular choice in Carnaticmusic (classical music from Southern India) is its ability toproduce a continuous pitch variation. This is an importantcomponent of the melodic motifs of raga music, that in-volve changing pitch and dynamics throughout the playinggesture. Consider the task of synthesizing a violin solo fora Carnatic music concert. Let us assume we have with us adataset with a number of notes at different pitches, volumescorresponding to different Carnatic Ragas. Given this, canwe train a system for the synthesis of “natural soundingmusic” in the same artist’s style given any ‘musical score’containing the typical continuous gesture motifs? a r X i v : . [ ee ss . A S ] A ug pR TAETAE x ( t ) CC H f CC R SinusoidalReconstructionIFFTAdd x (cid:48) ( t ) HarmonicResidual h ( t ) r ( t ) Figure 2 : Parametric Model for a single frame, Overlap-Add to obtain waveformWhere (or rather how) does parametric modeling comein then? Beauchamp [1] discusses the applicability of theSource-Filter (SF) model to violin audio. Unlike speech,the SF model has not been used widely to model musi-cal instruments because of the possible coupling betweenthe source and ﬁlter in instruments [25]. For the violinhowever, string vibrations are (largely) independent of thebody resonances, thus the independence assumption in theSF model is considered to hold [1, 16]. However, whatmakes the ﬁlter challenging to model is the observationthat violin resonances are found to be much sharper (nar-rower) than those of voice [1]. This might lead to the in-direct dependence of the ﬁlter on the source fundamentalfrequency f because of the f dependent sampling of theﬁlter spectral envelope. Subramani et al. [27] take careof these ‘inter-dependencies’ between the source and ﬁl-ter by using a conditional variational model which learnsthe ﬁlter conditioned on the source f . However, they onlywork with the harmonic component and neglect the resid-ual component of the violin, which is a limitation needingto be addressed.Fletcher et al. [9, 10] performed a very interesting se-ries of experiments on the perceptually important aspectsof violin synthesis. The ﬁrst study [9] discusses the salientaspects that could differentiate a ‘real’ violin tone from a‘synthesized’ one. One of those that is of interest to us, andwhich we will explore further is the residual noise inherentin tone production (the noise produced when drawing thebow across the string). For the lower frequency notes, thefundamental and harmonics mask the noise. However, forthe higher frequency notes, they are not able to mask thenoise, hence the noise becomes audible. While this ‘noise’helps in discriminating synthetic notes from real ones, thestudies claim that it is usually inaudible for notes of lowerfrequencies, but becomes audible at notes of higher fre-quencies. An important thing to consider while synthesiz- ing the violin tone is whether this noise is produced inde-pendently of the harmonic component, or whether there aresome dependencies. Fletcher et al. in their work synthesizethis noise by drawing the bow across the bridge without ex-citing the strings, thus effectively making it independent ofthe harmonic component. Mathews et al. [16] in their stud-ies propose a theory of ‘Resonant Enhancement’ of toneswhich states that the rich timbre of the violin is essentiallydue to the string vibrations being ﬁltered at the resonantlocations of the violin body. This effectively tells us that ifthe string vibrations are ﬁltered, then the noise produced bythe bowing should also be ﬁltered by the same resonances.Thus, both the harmonic and residual components are pro-duced by the same driving force and cannot be assumedto be independent. Keeping in mind these possible depen-dencies between the harmonic and residual components forviolin audio, we would like to investigate the joint model-ing of the harmonic and residual spectral envelopes. Wewill do this by evaluating the reconstruction of sustainednotes of various pitches and volume dynamics using signalreconstruction error. We also present audio examples ofthe same in the attached supplementary material.

2. PARAMETRIC MODEL

Figure 2 summarizes the parametric representation of vio-lin audio that we employ. It is a source-ﬁlter inspired rep-resentation that builds on top of the HpR model [2, 3]. Allthe blocks mentioned are performed on spectral frames ex-tracted from the sustain portions of single note recordingsby applying energy thresholds.1. We run the HpR model [24] on each spectral frame.2. We sub-sample the obtained Harmonic and Resid-ual Spectra. For the Harmonic, we only keep themplitude peaks corresponding to the harmonic lo-cations, and for the residual, we simply downsam-ple the original spectra to a chosen ﬁxed frequencyinterval. A residual subsampling rate of 100 Hz ismentioned for a sinusoidal representation of speechin [17]. We use a higher subsampling rate of 430 Hz,mentioned by Serra et al. in SMS-Tools [23, 24].3. With the sub-sampled spectra, we use the True Am-plitude Envelope (TAE) Algorithm [12,20] to obtaina smooth spectral envelope for each of the harmonicand residual components. The spectral envelopes arerepresented by their cepstral coefﬁcients. For theharmonic cepstral vector ( CC H ), the number of co-efﬁcients is chosen similar to the procedure in [27].The harmonic is also additionally characterized bythe fundamental frequency of the frame f . For theresidual cepstral vector ( CC R ), we work with a ﬁxednumber of cepstral coefﬁcients.4. To reconstruct the harmonic portion, the sinusoidamplitudes are sampled from the harmonic locationsof the TAE, and a sinusoidal reconstruction is per-formed. For the residual, we simply perform theinverse FFT of the residual spectrum with randomphases. The net reconstruction is the sum of the two.We use the HpR model as implemented in SMS-Tools[23, 24]. For the TAE algorithm, we use the implementa-tion in [27].

3. DATASET

There does not exist a publicly available dataset suitablefor synthesis of Carnatic Music, especially for the vio-lin. NSynth [7] is a large musical note recording dataset.Good-sounds [21] is also a similar dataset consisting ofmusical notes and scales recorded for different instru-ments. However, both of these dataset work with the MIDInotes and are not that expressive. Keeping in mind our taskof expressive synthesis, we would ideally like a datasetwhich is recorded keeping in mind the Carnatic playingstyle. We recorded an experienced Carnatic violinist play-ing a set of scale notes at various loudness and playingstyles as detailed in the following tables,

Carnatic Note

Sa Ri Ri Ga Ga Ma Notation

Sa Ri1 Ri2 Ga2 Ga3 Ma1

Carnatic Note Ma Pa Dha Dha Ni Ni Notation

Ma2 Pa Dha1 Dha2 Ni2 Ni3

Table 1 : Carnatic music notation for the 12 semitones ofan octave

Description NotationOctave Lower, Middle, Upper

L, M, U

Loudness Soft, Loud

So, Lo

Style Smooth, Attack

Sm, At

Table 2 : Recording Parameters For each note and choice of style, there are 2 instancesrecorded, each approximately 2-3 seconds long. More de-tails on the dataset is available in the attached supplemen-tary material.

4. GENERATIVE MODELS

Variational Autoencoders (VAE) [14] are our choice ofgenerative models. They can be viewed as an Autoencoderwith a prior enforced on the latent space [5]. They mini-mize the Variational Lower Bound given by, L = E z ∼ Q { log P ( X | z ) } − βD KL { Q ( z | X ) || P ( z ) } , where the ﬁrst term represents the Mean Squared Error(MSE) between the input and output, and the second termenforces the prior distribution on the latent space. β con-trols the trade-off [11] between the two terms. A VAE canbe thought as an encoder-decoder pair where the encoderoutputs the means and variances for the latent distribution.Using the re-parametrization trick [14], we sample from N ( , I ) and transform it through the encoder’s mean andvariance. This ‘latent’ variable is then passed through thedecoder to obtain the network’s reconstruction of the input.Conditional VAEs [26] work the same way as VAEs,however they condition the input on an additional condi-tioning variable. Successfully employed in [27] for thesynthesis of the harmonic component of violin, we extendthe same for modeling the residual signal. CVAE H CC H CC H' VAE R CC R CC R' f0 (a) Independent Modeling (INet) CVAE C CC H f0 CC R CC H CC R ' (b) Concatenative Modeling (ConcatNet) CVAE S (CC H + CC R ) f0 (CC H + CC R ) ' CVAE S (CC H + CC R ) (CC H + CC R ) ' CVAE D (CC H - CC R ) f0 (CC H - CC R ) ' (c) Modeling sum and difference (JNet) Figure 3 : Network Architectures

5. NETWORK ARCHITECTURE

The inputs to our CVAEs are the harmonic or residual CCs( CC H , CC R ), along with pitch f as a conditional input forthe harmonic CCs. We follow the experimental procedurepresented in [27] to obtain the optimal values of the hy-perparameters. The values are β = 10 − and latent space − − − a) M ag n i t ud e ( d B ) Spectral Envelopes

Sa, f = 328 HzGa2, f = 389 HzMa2, f = 461 Hz c)

Without f SaRi1Ri2Ga2Ga3Ma1Ma2PaDha1Dha2Ni2Ni3 e) With f SaRi1Ri2Ga2Ga3Ma1Ma2PaDha1Dha2Ni2Ni3 − − − − b) M ag n i t ud e ( d B ) SaGa2Ma2 d)

SaRi1Ri2Ga2Ga3Ma1Ma2PaDha1Dha2Ni2Ni3 f) SaRi1Ri2Ga2Ga3Ma1Ma2PaDha1Dha2Ni2Ni3

Figure 4 : Spectral Envelopes, Latent Space visualizations for Harmonic (top), Residual (bottom)dimensionality of 32. The Encoder is a linear fully con-nected neural network with leaky ReLU activations (allowsfor stable training and the output to have negative values).The Decoder is the same architecture as the encoder, butwith reversed dimensions. We have implemented all thenetworks in PyTorch [18]. We train our model on a mobileNvidia GeForce GTX 1070 with batch size 512. We useADAM [13] as the optimizer with an initial learning rateof − , and run the optimization for 2000 epochs.

6. EXPERIMENTS

We investigate the following two aspects in our work,1. The role of explicit pitch conditioning in the VAE tomodel the harmonic and residual components.2. Possible inter-dependencies between the harmonicand residual components.Figure 3 summarizes the 3 network architectures we inves-tigate. For the ﬁrst task, we will independently model theharmonic and residual components with individual CVAEsas shown in Figure 3a, and use a combination of spectralenvelop plots and CVAE latent space visualizations to ob-tain insights. For the second task, we shall introduce net-works that jointly model the harmonic and residual com-ponents as shown in Figure 3b, Figure 3c.

The traditional SF model from speech processing assumesindependence between the source and ﬁlter, which islargely true for vocal apparatus. If that were the case forthe violin as well, then we should in principle be able to model the violin by only modeling the spectral envelopefor a single f .Figure 4 a) shows Spectral Envelopes for differ-ent f . The shape differs across pitches. However,rather than these variations occurring because of the non-independence of the source and ﬁlter, we speculate thatthese occur because of the narrow resonances in the violinbody [1]. Thus, even for a slight change in f , the rela-tive amplitudes can change quite drastically. This has beennoted by Beauchamp in [1] and Fletcher in [10]. The en-velopes we plot in Figure 4 a) show exactly this variationacross pitches. Thus, by conditioning the envelopes on thepitch, we can expect the network to better reconstruct thespectral envelope.To further convince ourselves of the need for condi-tioning, we visualize the latent spaces of our VAE. Fig-ure 3a shows the network we employ. Since our latentspace is quite high dimensional (32 in our case), to visual-ize it, we use the t-SNE algorithm [15] that projects highdimensional data onto lower dimensions (2 in our case),and helps in effectively visualizing clusters in the data.Figure 4 c), e) shows the harmonic latent spaces withoutand with pitch conditioning. If the harmonic spectral enve-lope was independent of pitch, then we should ideally notbe seeing any clustering in the latent space. However, wecan see considerible clustering when we do not conditionon the pitch. Another interesting thing to observe in theclustering is its structure. For close notes, the clusters areclose, and the clusters move away (from right to left) asyou progress from the Sa to Ni3. The black arrow over-laid on top shows the progression of note clusters from Sato Ni3. In essence, this plot tells us that the latent spacestill contains information on the pitch, thus providing ad-itional motivation to condition the envelope on the pitch.On doing this, we can see in the latent space that all thenotes are clustered around together. Thus, with the pitchas a conditional, the decoder can correctly sample the la-tent space to obtain the correct harmonic envelope for thatpitch.The Residual envelopes depict a different picturethough. Figure 4 b) shows that the Residual Spectral en-velope does not signiﬁcantly change for different pitches,thus hinting that the residual spectral envelope is indeednot dependent on the pitch (as the SF model suggests).This can be explained by the fact that we have sufﬁcientlysub-sampled the actual residual spectrum to capture varia-tions in the envelope. The residual latent space visualiza-tions Figure 4 d), f) also re-afﬁrm our conclusion. You donot observe any kind of clustering, either without or withpitch conditioning, thus suggesting that the residual spec-tral envelopes are indeed independent of the pitch. One important question still remains. Are the harmonicand residual portions somehow coupled to each other? Ifthis is the case, simply modeling the individual compo-nents with independent networks could be sub-optimal.Where do these dependencies arise from - to answer that,we go back to the ‘Resonant Enhancement’ theory of tones[16] which states that the rich timbre of the violin is essen-tially due to the string vibrations being ﬁltered at the reso-nant locations of the violin body. When we bow the stringharder to produce a louder tone, the residual componentwill also be loud, and they both will be ﬁltered by the vio-lin body simultaneously, thus indicating that the harmonicand residual fundamentally depend on the playing style ofthe note. To check our hypothesis, we show the harmonicand residual spectral envelope variations in Figure 5 for thesame note by varying the loudness from soft to loud. − − − − − Frequency (kHz) M ag n i t ud e ( d B ) Harmonic SoftLoud − − − − − Frequency (kHz)Residual SoftLoud

Figure 5 : Spectral Envelopes for Soft and Loud notesThe blue lines in the harmonic spectral envelope in Fig-ure 5 represent the magnitude differences for the harmon-ics. If loudness variation were a simple amplitude scal-ing, then both the harmonic and spectral envelopes shouldbe shifted up (log-plots) and the blue lines should all bethe same length. However, as we see, a loudness increaseis not just a scaling. It causes certain frequencies to beboosted, others to be suppressed, and also changes the tiltin the spectral envelope. This further strengthens our hy-pothesis that the harmonic and residual envelopes must bedependent as they have a common underlying origin in the played style of the note. Figure 3b, Figure 3c shows the2 additional network architectures we try out, besides theindependent modeling used in the ﬁrst experiment.There could be many different ways to try joint mod-eling in a neural network. The simplest procedure how-ever is to simply concatenate the inputs and feed them toa CVAE to model them together, as shown in Figure 3b.Since the encoder and decoder are given as input both theharmonic and residual CCs, the reconstruction inherentlytakes into account both the harmonic and residual com-ponents. The second approach of modeling the sum anddifference of CCs is more non-trivial. The intuition behindit comes from current methods that generatively model themagnitude spectrum of the sound [19, 22]. The magnitudespectra is the sum of the harmonic and residual spectra.Thus, by directly modeling the spectrum, the autoencodertakes care of both of them together. If we could somehowmodel the difference of the harmonic and residual spec-tra as well, we could individually obtain the harmonic andresidual components. That is exactly what we try to do viaour network, as shown Figure 3c. We have 2 networks, thesum and difference networks. The sum network, in the pro-cess of autoencoding the sum of the harmonics and residualinherently learns their joint dependencies. The differencenetwork is a ‘trick’ to extract the individual harmonic andresidual components from the sum network. We can ob-tain the harmonic and residual vectors by simply addingand subtracting the outputs of the sum and difference net-works. One might ask why do we need the individual com-ponents? Keeping in mind the end-goal of being able tosynthesize audio, it would be good to have the harmonicand residual components if one is additionally interestedin ‘modifying’ the audio (time stretching, frequency scal-ing, morphing etc.)How to decide which network works better? We plotthe reconstruction MSE, which is computed as the aver-age over all test instance frames given as input to the net-work (test here refers to the fact that the network has notseen these during training). We work with the sustain por-tion of the notes in our dataset, and split it to train and testdata evenly. To allow the network to learn the potential de-pendencies of the harmonic and residual components, wetrain with frames of both loudness’s - soft and loud. Also,we choose notes in the higher octave because Fletcher etal. [9,10] mentions explicitly that the residual plays a moreimportant role perceptually in the higher octaves. Thus,with this joint modeling, we hope to see the residual beingreconstructed at a lower MSE.

Sa Ri1 Ri2 Ga2 Ga3 · − Note M S E Harmonic MSEINetJNetConcatNet

Sa Ri1 Ri2 Ga2 Ga3 · − NoteResidual MSEINetJNetConcatNet

Figure 6 : Reconstruction MSEigure 6 shows the note-wise reconstruction MSE. Forthe Harmonic MSE, the Independent Modeling proves tobe most superior. Interestingly, for the Residual MSE, thejoint modeling methods result in a lower MSE though, thusstrengthening our belief in joint modeling of the harmonicand residual components. We have also presented a fewaudio examples of note reconstructions in the attached sup-plementary material.

7. CONCLUSION

We introduce HpRNet - a framework combining genera-tive synthesis with parametric modeling of audio. To aidin our analysis, we also introduce a new Carnatic Violindataset, which we plan on making open to the MIR com-munity. We highlight the necessity of pitch conditioningfor the harmonic component. We also provide motivationto jointly model the harmonic and residual components in-stead of independently modeling them. The reconstructionMSE plots only give us a partial picture - to analyze theperceptual aspects of the reconstructed notes, we plan toconduct listening tests in the future where we present theoutputs from our model to experienced Carnatic violinists,and ask them to rate how ‘good’ they think the sound is,which will help us in zeroing onto the perceptually rele-vant aspects of the audio for synthesis. This work modeledthe sustain regions frame wise; the attack needs to be mod-eled for a complete representation. Eventually, we hope toapply this work to the synthesis of natural sounding ragamotifs or ornaments on the violin, characterized as they areby speciﬁc pitch and loudness dynamics. We also hope thatour dataset encourages further research in Carnatic musicsynthesis.

8. REFERENCES [1] James W Beauchamp. Comparison of vocal and vio-lin vibrato with relationship to the source/ﬁlter model.In

Studies in Musical Acoustics and Psychoacoustics ,pages 201–221. Springer, 2017.[2] Marcelo Caetano and Xavier Rodet. A source-ﬁltermodel for musical instrument sound transformation.In , pages 137–140. IEEE, 2012.[3] Marcelo Caetano and Xavier Rodet. Musical instru-ment sound morphing guided by perceptually moti-vated features.

IEEE Transactions on Audio, Speech,and Language Processing , 21(8):1666–1675, 2013.[4] Alexandre Défossez, Neil Zeghidour, Nicolas Usunier,Léon Bottou, and Francis Bach. Sing: Symbol-to-instrument neural generator. In

Advances in NeuralInformation Processing Systems , pages 9041–9051,2018.[5] Carl Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 , 2016. [6] Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, andAdam Roberts. Ddsp: Differentiable digital signal pro-cessing. arXiv preprint arXiv:2001.04643 , 2020.[7] Jesse Engel, Cinjon Resnick, Adam Roberts, SanderDieleman, Mohammad Norouzi, Douglas Eck, andKaren Simonyan. Neural audio synthesis of musicalnotes with wavenet autoencoders. In

Proceedings of the34th International Conference on Machine Learning-Volume 70 , pages 1068–1077. JMLR. org, 2017.[8] Philippe Esling, Adrien Bitton, et al. Generative tim-bre spaces: regularizing variational auto-encoders withperceptual metrics. arXiv preprint arXiv:1805.08501 ,2018.[9] Harvey Fletcher, E Donnell Blackham, and O NormanGeertsen. Quality of violin, viola,’cello, and bass-violtones. i.

The Journal of the Acoustical Society of Amer-ica , 37(5):851–863, 1965.[10] Harvey Fletcher and Larry C Sanders. Quality of violinvibrato tones.

The Journal of the Acoustical Society ofAmerica , 41(6):1534–1544, 1967.[11] Irina Higgins, Loic Matthey, Arka Pal, ChristopherBurgess, Xavier Glorot, Matthew Botvinick, ShakirMohamed, and Alexander Lerchner. beta-vae: Learn-ing basic visual concepts with a constrained variationalframework.

Iclr , 2(5):6, 2017.[12] S. IMAI. Spectral envelope extraction by improvedcepstrum.

IEICE , 62:217–228, 1979.[13] Diederik P Kingma and Jimmy Ba. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[14] Diederik P Kingma and Max Welling. Auto-encodingvariational bayes. arXiv preprint arXiv:1312.6114 ,2013.[15] Laurens van der Maaten and Geoffrey Hinton. Visu-alizing data using t-sne.

Journal of machine learningresearch , 9(Nov):2579–2605, 2008.[16] Max V Mathews and J Kohut. Electronic simulation ofviolin resonances.

The Journal of the Acoustical Soci-ety of America , 53(6):1620–1626, 1973.[17] Robert McAulay and Thomas Quatieri. Speech anal-ysis/synthesis based on a sinusoidal representation.

IEEE Transactions on Acoustics, Speech, and SignalProcessing , 34(4):744–754, 1986.[18] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer. Au-tomatic differentiation in PyTorch. In

NIPS AutodiffWorkshop , 2017.19] Fanny Roche, Thomas Hueber, Samuel Limier,and Laurent Girin. Autoencoders for music soundmodeling: a comparison of linear, shallow, deep,recurrent and variational models. arXiv preprintarXiv:1806.04096 , 2018.[20] Axel Roebel and Xavier Rodet. Efﬁcient spectral en-velope estimation and its application to pitch shiftingand envelope preservation. In

International Conferenceon Digital Audio Effects , pages 30–35, Madrid, Spain,September 2005. cote interne IRCAM: Roebel05b.[21] Oriol Romani Picas, Hector Parra Rodriguez, DaraDabiri, Hiroshi Tokuda, Wataru Hariya, Koji Oishi,and Xavier Serra. A real-time system for measuringsound goodness in instrumental sounds. In

Audio En-gineering Society Convention 138 . Audio EngineeringSociety, 2015.[22] Andy M Sarroff and Michael A Casey. Musical au-dio synthesis using autoencoding neural nets. In

ICMC ,2014.[23] Xavier Serra. A system for sound analy-sis/transformation/synthesis based on a deterministicplus stochastic decomposition. 1989.[24] Xavier Serra et al. Musical sound modeling with si-nusoids plus noise.

Musical signal processing , pages91–122, 1997.[25] Wayne Slawson. The color of sound: a theoreti-cal study in musical timbre.

Music Theory Spectrum ,3:132–141, 1981.[26] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learn-ing structured output representation using deep condi-tional generative models. In

Advances in neural infor-mation processing systems , pages 3483–3491, 2015.[27] K. Subramani, P. Rao, and A. D’Hooge. Vapar synth -a variational parametric model for audio synthesis. In

ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) ,pages 796–800, 2020.[28] Lonce Wyse. Real-valued parametric conditioning ofan rnn for interactive sound synthesis. arXiv preprintarXiv:1805.10808arXiv preprintarXiv:1805.10808