DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks
DDRUMGAN: SYNTHESIS OF DRUM SOUNDS WITH TIMBRAL FEATURECONDITIONING USING GENERATIVE ADVERSARIAL NETWORKS
Javier Nistal
Sony CSLParis, France
Stefan Lattner
Sony CSLParis, France
Gaël Richard
LTCI, Télécom ParisInstitut Polytechnique de Paris, France
ABSTRACT
Synthetic creation of drum sounds (e.g., in drum machines)is commonly performed using analog or digital synthesis,allowing a musician to sculpt the desired timbre modify-ing various parameters. Typically, such parameters controllow-level features of the sound and often have no musicalmeaning or perceptual correspondence. With the rise ofDeep Learning, data-driven processing of audio emergesas an alternative to traditional signal processing. This newparadigm allows controlling the synthesis process throughlearned high-level features or by conditioning a modelon musically relevant information. In this paper, we ap-ply a Generative Adversarial Network to the task of au-dio synthesis of drum sounds. By conditioning the modelon perceptual features computed with a publicly availablefeature-extractor, intuitive control is gained over the gen-eration process. The experiments are carried out on a largecollection of kick, snare, and cymbal sounds. We showthat, compared to a specific prior work based on a U-Netarchitecture, our approach considerably improves the qual-ity of the generated drum samples, and that the conditionalinput indeed shapes the perceptual characteristics of thesounds. Also, we provide audio examples and release thecode used in our experiments.
1. INTRODUCTION
Drum machines are electronic musical instruments thatcreate percussion sounds and allow to arrange them in pat-terns over time. The sounds produced by some of thesemachines are created synthetically using analog or digitalsignal processing. For example, a simple snare drum canbe synthesized by generating noise and shaping its am-plitude envelope [1] or, a bass drum, by combining low-frequency harmonic sine waves with dense mid-frequencycomponents [2]. The characteristic sound of this synthe-sis process contributed to the cult status of electronic drummachines in the ’80s. https://github.com/SonyCSLParis/DrumGAN c (cid:13) Javier Nistal, Stefan Lattner, Gaël Richard. Licensed un-der a Creative Commons Attribution 4.0 International License (CC BY4.0).
Attribution:
Javier Nistal, Stefan Lattner, Gaël Richard, “Drum-GAN: Synthesis of Drum Sounds with Timbral Feature Conditioning Us-ing Generative Adversarial Networks”, in
Proc. of the 21st Int. Societyfor Music Information Retrieval Conf.,
Montréal, Canada, 2020.
Data-driven processing of audio using Deep Learning(DL) emerged as an alternative to traditional signal pro-cessing. This new paradigm allows us to steer the syn-thesis process by manipulating learned higher-level latentvariables, which provide a more intuitive control comparedto conventional drum machines and synthesizers. In addi-tion, as DL models can be trained on arbitrary data, com-prehensive control over the generation process can be en-abled without limiting the sound characteristic to that ofa particular synthesis process. For example, GenerativeAdversarial Networks (GANs) allow to control drum syn-thesis through their latent input noise [3] and VariationalAutoencoders (VAE) can be used to create variations ofexisting sounds by manipulating their position in a learnedtimbral space [4]. However, an essential issue when learn-ing latent spaces in an unsupervised manner is the miss-ing interpretability of the learned latent dimensions. Thiscan be a disadvantage in music applications, where com-prehensible interaction lies at the core of the creative pro-cess. Therefore, it is desirable to develop a system whichoffers expressive and musically meaningful control overits generated output. A way to achieve this, provided thatsuitable annotations are available, is to feed higher-levelconditioning information to the model. The user can thenmanipulate this conditioning information in the generationprocess. Along this line, some works on sound synthesishave incorporated pitch-conditioning [5, 6], or categoricalsemantic tags [7], capturing rather abstract sound charac-teristics. In the case of drum pattern generation, there areneural-network approaches that can create full drum tracksconditioned on existing musical material [8].In a recent study [9], a U-Net is applied to neuraldrum sound synthesis, conditioned on continuous percep-tual features describing timbre (e.g., boominess, bright-ness, depth). These features are computed using the Au-dio Commons timbre models. Compared to prior work,this continuous feature conditioning (instead of using cat-egorical labels) for audio synthesis provides more fine-grained control to a musician. However, this U-Net ap-proach learns a deterministic mapping of the conditioninginput information to the synthesized audio. This limits themodel’s capacity to capture the variance in the data, result-ing in a sound quality that does not seem acceptable in aprofessional music production scenario.In this paper, we build upon the same idea of con-ditional generation using continuous perceptual features, https://github.com/AudioCommons/ac-audio-extractor a r X i v : . [ ee ss . A S ] A ug ut instead of a U-Net, we employ a Progressive Grow-ing Wasserstein GAN (PGAN) [10]. Our contribution istwo-fold. First, we employ a PGAN on the task of con-ditional drum sound synthesis. Second, we use an auxil-iary regression loss term in the discriminator as a means tocontrol audio generation based on the conditional features.We are not aware of previous work attempting continuous sparse conditioning of GANs for musical audio generation.We conduct our experiments on a dataset of a large vari-ety of kick, snare, and cymbal sounds comprising approx-imately 300k samples. Also, we investigate whether thefeature conditioning improves the quality and coherenceof the generated audio. For that, we perform an exten-sive experimental evaluation of our model, both in condi-tional and unconditional settings. We evaluate our modelsby comparing the Inception Score (IS), the Fréchet AudioDistance (FAD), and the Kernel Inception Distance (KID).Furthermore, we evaluate the perceptual feature condition-ing by testing if changing the value of a specific input fea-ture yields the expected change of the corresponding fea-ture in the generated output. Audio samples of DrumGANcan be found on the accompaniment website (see Section4). The paper is organized as follows: In Section 2 we re-view previous work on audio synthesis, and in Section 3we describe the experiment setup. Results are presented inSection 4, and we conclude in Section 5.
2. PREVIOUS WORK
Deep Generative modeling is a topic that has gained a lotof interest during the last years. This has been possiblepartly due to the growing amount of large-scale datasetsof different modalities [5, 11] coupled with groundbreak-ing research on generative neural networks [10, 12–15]. Inaddition to the methods listed in the introduction focus-ing on drums sound generation, a number of other studieshave applied deep learning methods to address general au-dio synthesis. Autoregressive models for raw audio havebeen very influential in the beginning of this line of re-search, and still achieve state of the art in different audiosynthesis tasks [5, 12, 16, 17]. Approaches using Varia-tional Auto-Encoders [13] allow manipulating the audio inlatent spaces learnt i) directly from the audio data [4], ii)by imposing musically meaningful priors over the struc-ture of these spaces [7, 18, 19], or iii) by restricting suchlatent codes to discrete representations [20]. GANs havebeen extensively applied to synthesis of speech [21] anddomain adaptation [22, 23] tasks. The first of its kind ap-plying adversarial learning to the synthesis of musical au-dio is WaveGAN [3]. This architecture was shown to syn-thesize audio from a variety of sound sources, includingdrums, in an unconditional way. Recent improvements inthe quality and training stability of GANs [10, 24, 25] re-sulted in methods that outperform WaveNet baselines onthe task of audio synthesis of musical notes using sparseconditioning labels representing the pitch content [6]. Afew other works have used GANs with rather strong con-ditioning on prior information for tasks like Mel-spectrum inversion [26] or audio domain adaptation [27, 28]. Re-cently, other promising related research incorporates priordomain-knowledge into the neural network, by embeddingdifferentiable signal processing blocks directly into the ar-chitecture [29].
3. EXPERIMENT SETUP
In this Section details are given about the conducted ex-periment, including the data used, the model architectureand training details, as well as the metrics employed forevaluation.
In the following, we briefly describe the drum dataset usedthroughout our experiments, as well as the Audio Com-mons feature models, with which we extract perceptivefeatures from the dataset.
For this work, we make use of an internal, non-publiclyavailable dataset of approximately 300k one-shot audiosamples aligned and distributed across a balanced set ofkick, snare, and cymbal sounds. The samples originallyhave a sample rate of 44.1kHz and variable lengths. Inorder to make the task simpler, each sample is shortenedto a duration of one second and down-sampled to a sam-ple rate of 16kHz. For each audio sample, we extract per-ceptual features with the Audio Commons timbre models(see Section 3.1.2). We perform an 90% / 10% split ofthe dataset for validation purposes. The model is trainedon the real and imaginary components of the Short-TimeFourier Transform (STFT), which has been shown to workwell in [30]. We compute the STFT using a window size of2048 samples and 75% overlapping. The generated spec-trograms are then simply inverted back to the signal do-main using the inverse STFT.
The Audio Commons project implements a collection ofperceptual models of features that describe high-level tim-bral properties of the sound. These features are designedfrom the study of popular timbre ratings given to a collec-tion of sounds obtained from Freesound . The models arebuilt by combining existing low-level features found in theliterature (e.g., spectral centroid, dynamic-range, spectralenergy ratios, etc), which correlate with the target prop-erties enumerated below. All features are defined in therange [0-100]. We employ these features as conditioningto the generative model. For more information, we directthe reader to the project deliverable. • brightness : refers to the clarity and amount of high-pitched content in the analyzed sound. It is com-puted from the spectral centroid and the spectral en-ergy ratio. https://freesound.org/ hardness : refers to the stiffness or solid nature of theacoustic source that could have produced a sound. Itis estimated using a linear regression model on spec-tral and temporal features extracted from the attacksegment of a sound event. • depth : refers to the sensation of perceiving a soundcoming from an acoustic source beneath the sur-face. A linear regression model estimates depth fromthe spectral centroid of the lower frequencies, theproportion of low frequency energy and the low-frequency limit of the audio excerpt. • roughness : refers to the irregular and uneven sonictexture of a sound. It is estimated from the interac-tion of peaks and nearby bins within frequency spec-tral frames. When neighboring frequency compo-nents have peaks with similar amplitude, the soundis said to produce a ‘rough’ sensation. • boominess : refers to a sound with deep and loudresonant components. • warmth : refers to sounds that induce a sen-sation analogous to that caused by the physicaltemperature. • sharpness : refers to a sound that might cut if it wereto take on physical form. In the following, we will introduce the architecture andtraining of DrumGAN, and will briefly describe the base-line model against which DrumGAN is evaluated.
Generative Adversarial Networks (GAN) are a family oftraining procedures inspired by game theory, in which agenerative model competes against a discriminative adver-sary, that learns to distinguish whether a sample is real orfake [14]. The generative network, or Generator (G), esti-mates a distribution p g over the data x by learning a map-ping of an input noise p z to data space as G θ ( z ) , where G θ is a neural network implementing a differentiable func-tion with parameters θ . Inversely, the discriminator D β ( x ) ,with parameters β is trained to output a single scalar indi-cating whether the input comes from the real data p r orfrom the generated distribution p g . Simultaneously, G istrained to produce samples that are identified as real by thediscriminator. Competition drives both networks until anequilibrium point is reached and the generated examplesare indistinguishable from the original data. For a Wasser-stein GAN, as used in our experiments, the training crite-rion is formally defined as min G max D Γ( D, G ) = 1 N (cid:88) i D ( x i ) − D ( G ( z i )) . (1) Description of the calculation method for this feature is not availableto the authors at current time.
Figure 1 . Proposed architecture for DrumGAN (see Sec-tion 3.2 for details).
In the proposed architecture, the input to G is a concate-nation of the 7 conditioning features c , described in Sec-tion 3.1.2, and a random vector z with 128 componentssampled from an independent Gaussian distribution. Theresulting vector is fed through a stack of convolutionaland box up-sampling blocks to generate the output signal x = G θ ( z ; c ) . In order to turn the 1D input vector into a4D convolutional input, it is zero-padded in the time- andfrequency-dimension (i.e., placed in the middle of the con-volutional input with
128 + 7 convolutional maps). As de-picted in Figure 1, the generator’s input block performs thiszero-padding followed by two convolutional layers withReLU non-linearity. Each scale block is composed of onebox up-sampling step at the input and two convolutionallayers with filters of size (3 , . The number of featuremaps decreases from low to high resolution as {256, 128,128, 128, 64, 32}. Up-sampling of the temporal dimen-sion is just performed after the 3rd scale block. We use aLeaky ReLU as activation functions and apply pixel nor-malization after every convolutional step (i.e., normaliz-ing the norm over the output maps at each position). Thediscriminator D is composed of convolutional and down-sampling blocks, mirroring G ’s configuration. Given abatch of either real or generated STFT audio, D estimatesthe Wasserstein distance between the real and generateddistributions [24], and predicts the perceptual features ac-ompanying the input audio in the case of a real batch, orthose used for conditioning in the case of generated audio.In order to promote the usage of the conditioning informa-tion by G , we add an auxiliary Mean Squared Error (MSE)loss term to the objective function, following a similar ap-proach as in [31]. We use a gradient penalty of 10.0 tosatisfy the Lipschitz continuity condition of WassersteinGANs. The weights are initialized to zero and we ap-ply layer-wise normalization at run-time using He’s con-stant [32] to promote an equalized learning. A mini-batchstandard deviation layer before the output block of D en-courages G to generate more variety and, therefore, reducemode collapse [25]. Training follows the procedure of Progressive Growingof GANs (PGANs), first used for image generation [10],which has been successfully applied to audio synthesis ofpitched sounds [6]. In a PGAN, the architecture is builtdynamically during training. The process is divided intotraining iterations that progressively introduce new blocksto both the Generator and the Discriminator, as depicted inFigure 1. While training, a blending parameter α progres-sively fades in the gradient derived from the new blocks,minimizing possible perturbation effects. The models aretrained for 1.1M iterations on batches of 30, 30, 20, 20 12and 12 samples, respectively for each scale. Each scale istrained during 200k iterations except the last one, whichis trained up to 300k iterations. We employ Adam as theoptimization method and a learning rate of 0.001 for bothnetworks. As mentioned in the introduction, we compare DrumGANagainst a previous work tackling the exact same task (i.e.,neural synthesis of drums sounds, conditioned on the sameperceptual features described in Section 3.1.2), but using aU-Net architecture operating in the time domain [9]. TheU-Net model is trained to deterministically map the con-ditioning features (and an envelope of the same size as theoutput) to the output. The dataset used thereby consists of k drum samples obtained from Freesound , which in-cludes kicks, snares, cymbals, and other percussion sounds(referred to as Freesound drum subset in the following).
Assessing the quality of synthesized audio is hard to for-malize making the evaluation of generative models for au-dio a challenging task. In the particular case of GANs,where no explicit likelihood maximization exists, a com-mon evaluation approach is to measure the model’s perfor-mance in a variety of surrogate tasks [33]. As describedin the following, we evaluate our models against a diverseset of metrics that capture distinct aspects of the model’sperformance. The
Inception Score (IS) [25] penalizes models that gener-ate examples that are not classified into a single class withhigh confidence, as well as models whose examples belongto only a few of all the possible classes. It is defined as themean KL divergence between the conditional class proba-bilities p ( y | x ) , and the marginal distribution p ( y ) using theclass predictions of an Inception classifier (see Eq. 2). Wetrain our Inception Net variant to classify kicks, snaresand cymbals, from Mel-scaled magnitude STFT spectro-grams using the same train/validation split of 90% / 10%,used throughout our experiments. As additional targets, wealso train the model to predict the extracted perceptual fea-tures described in Section 3.1.2 (using mean-squared errorcost). IS = exp (cid:0) E x [ KL ( p ( y | x ) || p ( y ))] (cid:1) (2) The
Fréchet Audio Distance compares the statistics of realand generated data computed from an embedding layer of apre-trained VGG-like model [34]. FAD fits a continuousmultivariate Gaussian to the output of the embedding layerfor real and generated data and the distance between theseis calculated as: F AD = || µ r − µ g || + tr (Σ r + Σ g − (cid:112) Σ r Σ g ) (3)where ( µ r , Σ r ) and ( µ g , Σ g ) are the mean and co-variances of the embedding of real and generated data re-spectively. The lower the FAD, the smaller the distancebetween distributions of real and generated data. FAD isrobust against noise, consistent with human judgments andmore sensible to intra-class mode dropping than IS. The KID measures the dissimilarity between samplesdrawn independently from real and generated distributions[35]. It is defined as the squared Maximum Mean Discrep-ancy (MMD) between representations of the last layer ofthe Inception model (described in Section 3.3.1). A lowerMMD means that the generated p g and real p r distributionsare close to each other. We employ the unbiased estimatorof the squared MMD [36] between m samples x ∼ p r and n samples y ∼ p g , for some fixed characteristic kernelfunction k , defined as MMD ( X, Y ) = 1 m ( m − m (cid:88) i (cid:54) = j k ( x i , x j )+ 1 n ( n − n (cid:88) i (cid:54) = j k ( y i , y j ) − mn m (cid:88) i =1 n (cid:88) j =1 k ( x i , y j ) . (4) Here, we use an inverse multi-quadratic kernel (IMQ) k ( x, y ) = 1 / (1 + || x − y || / γ ) with γ = 8 [37], which https://github.com/pytorch/vision/blob/master/torchvision/models/inception.py https://github.com/google-research/google-research/tree/master/frechet_audio_distance as a heavy tail and, hence, it is sensitive to outliers. Weborrow this metric from the Computer Vision literature andapply it to the audio domain. We train a separate incep-tion model on the FreeSound drum subset used for the U-Net baseline experiments (see Section 3.2.4). This is doneto allow comparison of the inception-based metrics withDrumGAN. Since the FreeSound drum subset doesn’t con-tain annotations of the instrument type, we train our varianton just the feature regression task, and restrict our compari-son to KID and FAD, as these metrics do not compare classprobabilities but embedding distributions. We follow the methodology proposed by [9] for evaluat-ing the feature control coherence. The goal is to assesswhether increasing or decreasing a specific feature valueof the conditioning input yields the corresponding changeof that feature in the synthesized audio. To this end, a spe-cific feature i is set to . (low), . (mid), and . (high),keeping the other features and the input noise fixed. The re-sulting outputs x i low , x i mid , x i high are then evaluated with theAudio Commons Timbre Models (yielding features f x i ).Then, it is assessed if the feature of interest changed as ex-pected (i.e., f x i low < f x i mid < f x i high ). More precisely,three conditions are evaluated: E1: f x i low < f x i high , E2: f x i mid < f x i high , and E3: f x i low < f x i mid . We perform thesethree tests times for each feature, always with differ-ent random input noise and different configurations of theother features (sampled from the evaluation set). The re-sulting accuracies are reported.
4. RESULTS AND DISCUSSION
In this section, we briefly describe our subjective impres-sion when listening to the model output, and we will givean extended discussion on the quantitative analysis, includ-ing the comparison with the baseline U-Net architecture.
The results of the qualitative experiments discussed in thissection can be found on the accompaniment website. Ingeneral, conditional DrumGAN seems to have better qual-ity than its unconditional counterpart and substantially bet-ter than the U-Net baseline (see Section 3.2.4). In the ab-sence of more reliable baselines, we argue that the per-ceived quality of DrumGAN is comparable to that of pre-vious state-of-the-art work on adversarial audio synthesisof drums [3].We also perform radial and spherical interpolation ex-periments (with respect to the Gaussian prior) between ran-dom points selected in the latent space of DrumGAN. Bothinterpolations yield smooth and perceptually linear transi-tions in the audio domain. We notice that radial interpo-lation tend to change the percussion type (i.e., kick, snare,cymbal) of the output, while spherical interpolation affectsother properties (like within-class timbral characteristicsand envelope) of the synthesized audio. This gives a hinton how the latent manifold is structured. https://sites.google.com/view/drumgan IS KID FAD real data train feats val feats rand feats unconditional
Table 1 . Results of Inception Score (IS, higher is bet-ter), Kernel Inception Distance (KID, lower is better) andFréchet Audio Distance (FAD, lower is better), scored byDrumGAN under different conditioning settings, againstreal data and the unconditional baseline. The metrics arecomputed over k samples, except for val feats , where k samples are used (i.e., the validation set size).KID FAD real data real feats rand feats Table 2 . Results of Kernel Inception Distance (KID) andFréchet Audio Distance (FAD), scored by the U-Net base-line [9] when conditioning the model on feature configu-rations from the real data and on randomly sampled fea-tures. The metrics are computed over k samples (i.e.,the Freesound drum subset size). Table 1 shows the DrumGAN results for the InceptionScore (IS), the Kernel Inception Distance (KID), and theFréchet Audio Distance (FAD), as described in Section3.3. These metrics are calculated on the synthesized drumsounds of the model, based on different conditioning set-tings. Besides the unconditional setting of DrumGAN( unconditional ), we use feature configurations from thetrain set ( train feats ), the valid set ( valid feats ), and fea-tures randomly sampled from a uniform distribution ( randfeats ). The IS of DrumGAN samples is close to that ofthe real data in most settings. This means that the modeloutputs are clearly assignable to either of the respectivepercussion-type classes (i.e., low entropy for kick, snare,and cymbal posteriors), and that it doesn’t omit any ofthem (i.e., high entropy for the marginal over all classes).The IS is slightly reduced for random conditioning fea-tures, indicating that using uncommon conditioning con-figurations makes the outputs more ambiguous with re-spect to specific percussion types. While FAD is a measurefor the perceived quality of the individual sounds (measur-ing co-variances within data instances), the KID reflects ifthe generated data overall follows the distribution of thereal data. Therefore, it is interesting to see that rand feats cause outputs which overall do not follow the distributionof the real data (i.e., high KID), but the individual outputsare still plausible percussion samples (i.e., low FAD). Thisquantitative result is in-line with the perceived quality of -Net DrumGANFeature E1 E2 E3 E1 E2 E3brightness depth boominess average
Table 3 . Mean accuracies for the feature coherence testson samples generated with the baseline U-Net [9] andDrumGAN.the generated samples (see Section 4.1). In the uncondi-tional setting, both KID and FAD are worse, indicating thatfeature conditioning helps the model to both generate datafollowing the true distribution, overall, as well as in indi-vidual samples.Table 2 shows the evaluation results for the U-Net archi-tecture (see Section 3.2.4). As the train / valid split for theFreesound drum subset (on which the U-Net was trained)is not available to the authors, the U-Net model is testedusing the features of the full Freesound drum subset ( realfeats ), as well as random features. Also, we do not re-port the IS for the U-Net architecture, as it was trained ondata without percussion-type labels, making it impossibleto train the inception model on such targets. As a baseline,all metrics are also evaluated on the real data on whichthe respective models were trained. While evaluation onthe real data is straight-forward for the IS (i.e., just usingthe original data instead of the generated data to obtain thestatistics), both KID and FAD are measures usually com-paring the statistics between features of real and generateddata. Therefore, for the real data baseline, we split thereal data into two equal parts and compare those with eachother in order to obtain KID and FAD. The performanceof the U-Net approach on both, KID and FAD is consid-erably worse than that of DrumGAN. While the KID for real feats is still comparable to that of DrumGAN (indi-cating a distribution similar to that of the real data), thehigh FAD indicates that the generated samples are not per-ceptually similar to the real samples. When using randomfeature combinations this trend is accentuated moderatelyin the case of FAD, and particularly in the case of the KID,reaching a maximum of almost . This is, however, in-telligible, as the output of the U-Net depends only on theinput features in a deterministic way. Therefore, it is to ex-pect that the distribution over output samples changes fullywhen fully changing the distribution of the inputs. Table 3 shows the accuracy of the three feature coherencetests explained in Section 3.3.4. Note that, as both mod-els were trained on different data, the figures of the twomodels are not directly comparable. However, also report-ing the figures of the U-Net approach should provide somecontext on the performance of our proposed model. In ad- dition, as both works use the same feature extractors andclaim that the conditional features are used to shape thesame characteristics of the output, we consider the figuresfrom the U-Net approach a useful reference. We can seethat for about half the features, the U-Net approach reachesclose to accuracy. Referring to the descriptions onhow the features are computed it seems that the U-Netapproach reaches particularly high accuracies for featureswhich are computed by looking at the global frequency dis-tribution of the audio sample, taking into account spectralcentroid and relations between high and low frequencies(e.g., brightness and depth). U-Net performs considerablyworse for features which take into account the temporalevolution of the sound (e.g., hardness) or more complexrelationships between frequencies (e.g., roughness). WhileDrumGAN performs worse on average on these tests, theresults seem to be more consistent, with less very high, butalso less rather low accuracy values (note that the random-guessing baseline is . for all the tests). The reason fornot performing better on average may lie in the fact thatDrumGAN is trained in an adversarial fashion, where thedataset distribution is enforced, in addition to obeying theconditioned characteristics. In contrast, in the U-Net ap-proach the model is trained deterministically to map theconditioning features to the output, which makes it eas-ier to satisfy the simpler characteristics, like generating alot of low- or high-frequency content. However, this de-terministic mapping results in a lower audio quality and aworse approximation to the true data distribution, as it canbe seen in the KID and FAD figures, described above.
5. CONCLUSIONS AND FUTURE WORK
In this work, we performed percussive sound synthesis us-ing a GAN architecture that enables steering the synthesisprocess using musically meaningful controls. To this end,we collected a dataset of approximately 300k audio sam-ples containing kicks, snares, and cymbals. We extracteda set of timbral features, describing high-level semanticsof the sound, and used these as conditional input to ourmodel. We encouraged the generator to use the condition-ing information by performing an auxiliary feature regres-sion task in the discriminator and adding the correspond-ing MSE loss term to the objective function. In order toassess whether the feature conditioning improves the gen-erative process, we trained a model in a completely unsu-pervised manner for comparison. We evaluated the mod-els by comparing various metrics, each reflecting differ-ent characteristics of the generation process. Additionally,we compared the coherence of the feature control againstprevious work. Results showed that DrumGAN generateshigh-quality drum samples and provides meaningful con-trol over the audio generation. The conditioning informa-tion was proven useful and enabled the network to betterapproximate the real distribution of the data. As futurework, we will focus on scaling the model to work with au-dio production standards (e.g., 44.1kHz sample rate, stereoaudio), and implement a plugin that can be used in a con-ventional Digital Audio Workstation (DAW). . REFERENCES [1] R. Gordon, “Synthesizing drums: The snaredrum,”
Sound On Sound
Sound On Sound
Proc. of the 7th InternationalConference on Learning Representations, ICLR , May2019.[4] C. Aouameur, P. Esling, and G. Hadjeres, “Neuraldrum machine: An interactive system for real-timesynthesis of drum sounds,” in
Proc. of the 10th Interna-tional Conference on Computational Creativity, ICCC ,June 2019.[5] J. Engel, C. Resnick, A. Roberts, S. Dieleman,M. Norouzi, D. Eck, and K. Simonyan, “Neural au-dio synthesis of musical notes with wavenet autoen-coders,” in
Proc. of the 34th International Conferenceon Machine Learning, ICML , Sydney, NSW, Australia,Aug. 2017.[6] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Don-ahue, and A. Roberts, “Gansynth: Adversarial neu-ral audio synthesis,” in
Proc. of the 7th InternationalConference on Learning Representations, ICLR , May2019.[7] P. Esling, N. Masuda, A. Bardet, R. Despres, andA. Chemla-Romeu-Santos, “Universal audio synthe-sizer control with normalizing flows,”
Journal of Ap-plied Sciences , 2019.[8] S. Lattner and M. Grachten, “High-level control ofdrum track generation using learned patterns of rhyth-mic interaction,” in
IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics, WASPAA ,New Paltz, NY, USA, Oct. 2019.[9] A. Ramires, P. Chandna, X. Favory, E. Gómez, andX. Serra, “Neural percussive synthesis parameterisedby high-level timbral features,” in
IEEE InternationalConference on Acoustics, Speech and Signal Process-ing, ICASSP , May 2020.[10] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progres-sive growing of gans for improved quality, stability,and variation,” in
International Conference on Learn-ing Representations, ICLR , May 2018.[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Im-age Database,” in
IEEE Conference on Computer Vi-sion and Pattern Recognition, CVPR , June 2009. [12] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior,and K. Kavukcuoglu, “Wavenet: A generative modelfor raw audio,” in
Proc. of the 9th ISCA Speech Syn-thesis Workshop , Sunnyvale, CA, USA, Sept. 2016.[13] D. P. Kingma and M. Welling, “Auto-encoding varia-tional bayes,” in
Proc. of the 2nd International Confer-ence on Learning Representations, ICLR , Banff, AB,Canada, Apr. 2014.[14] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Ben-gio, “Generative adversarial nets,” in
Proc. of the28th International Conference on Neural InformationProcessing Systems NIPS , Montreal, Quebec, Canada,Dec. 2014.[15] A. van den Oord, N. Kalchbrenner, andK. Kavukcuoglu, “Pixel recurrent neural networks,”in
Proc. of the 33rd International Conference onMachine Learning, ICML , New York City, NY, USA,June 2016.[16] S. Vasquez and M. Lewis, “Melnet: A generativemodel for audio in the frequency domain,”
CoRR , vol.abs/1906.01083, 2019.[17] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly,Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A.Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natu-ral TTS synthesis by conditioning wavenet on MELspectrogram predictions,” in
IEEE International Con-ference on Acoustics, Speech and Signal Processing,ICASSP , Calgary, AB, Canada, Apr. 2018.[18] P. Esling, A. Chemla-Romeu-Santos, and A. Bit-ton, “Bridging audio analysis, perception and syn-thesis with perceptually-regularized variational timbrespaces,” in
Proc. of the 19th International Society forMusic Information Retrieval Conference, ISMIR , Paris,France, Sept. 2018.[19] ——, “Generative timbre spaces with variational audiosynthesis,” in
Proc. of the 21st International Confer-ence on Digital Audio Effects DAFx-18 , Aveiro, Portu-gal, Sept. 2018.[20] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford,and I. Sutskever, “Jukebox: A generative model formusic,”
CoRR , vol. abs/2005.00341, 2020.[21] Y. Saito, S. Takamichi, and H. Saruwatari, “Statisti-cal parametric speech synthesis incorporating gener-ative adversarial networks,”
IEEE/ACM Trans. AudioSpeech Lang. Process. , vol. 26, pp. 84–96, 2018.[22] T. Kaneko and H. Kameoka, “Parallel-data-free voiceconversion using cycle-consistent adversarial net-works,”
CoRR , vol. abs/1711.11293, 2017.23] S. Huang, Q. Li, C. Anil, X. Bao,S. Oore, and R. B. Grosse, “Timbretron: Awavenet(cyclegan(cqt(audio))) pipeline for musi-cal timbre transfer,” in
Proc. of the 7th InternationalConference on Learning Representations, ICLR , NewOrleans, LA, USA, May 2019.[24] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin,and A. C. Courville, “Improved training of wasser-stein gans,” in
Proc. of the International Conference onNeural Information Processing Systems, NIPS , LongBeach, CA, USA, Dec. 2017.[25] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Che-ung, A. Radford, and X. Chen, “Improved techniquesfor training gans,” in
Proc. of the International Confer-ence on Neural Information Processing Systems, NIPS ,Barcelona, Spain, Dec. 2016.[26] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z.Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C.Courville, “Melgan: Generative adversarial networksfor conditional waveform synthesis,” in
Proc. of theAnnual Conference on Neural Information ProcessingSystems, NIPS , Vancouver, BC, Canada, Dec. 2019.[27] E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher,“A multi-discriminator cyclegan for unsupervised non-parallel speech domain adaptation,” in
Proc. of the 19thAnnual Conference of the International Speech Com-munication Association , Hyderabad, India, Sept. 2018.[28] D. Michelsanti and Z. Tan, “Conditional generative ad-versarial networks for speech enhancement and noise-robust speaker verification,” in
Proc. of the 18th An-nual Conference of the International Speech Communi-cation Association, Interspeech , Stockholm, Sweden,Aug. 2017.[29] J. H. Engel, L. Hantrakul, C. Gu, and A. Roberts,“DDSP: differentiable digital signal processing,” in
Proc. of the 8th International Conference on LearningRepresentations, ICLR , Addis Ababa, Ethiopia, Apr.2020.[30] J. Nistal, S. Lattner, and G. Richard, “Comparing rep-resentations for audio synthesis using generative adver-sarial networks,” in
Proc. of the 28th European SignalProcessing Conference, EUSIPCO2020 , Amsterdam,NL, Jan. 2021.[31] A. Odena, C. Olah, and J. Shlens, “Conditional imagesynthesis with auxiliary classifier gans,” in
Proc. of the34th International Conference on Machine Learning,ICML , Sydney, NSW, Australia, Aug. 2017.[32] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deepinto rectifiers: Surpassing human-level performance onimagenet classification,” in
IEEE International Confer-ence on Computer Vision, ICCV , Santiago, Chile, Dec.2015. [33] L. Theis, A. van den Oord, and M. Bethge, “A noteon the evaluation of generative models,” in
Proc. of the4th International Conference on Learning Representa-tions, ICLR , San Juan, Puerto Rico, May 2016.[34] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi,“Fréchet audio distance: A metric for evaluating musicenhancement algorithms,”
CoRR , vol. abs/1812.08466,2018.[35] M. Binkowski, D. J. Sutherland, M. Arbel, and A. Gret-ton, “Demystifying MMD gans,” in
Proc. of the 6th In-ternational Conference on Learning Representations,ICLR , Vancouver, BC, Canada, Apr. 2018.[36] A. Gretton, K. M. Borgwardt, M. J. Rasch,B. Schölkopf, and A. J. Smola, “A kernel two-sampletest,”
Journal of Machine Learning Research , vol. 13,pp. 723–773, 2012.[37] R. M. Rustamov, “Closed-form expressions for max-imum mean discrepancy with applications to wasser-stein auto-encoders,”