[PDF] Latent Vector Recovery of Audio GANs

Abstract

Advanced Generative Adversarial Networks (GANs) are remarkable in generating intelligible audio from a random latent vector. In this paper, we examine the task of recovering the latent vector of both synthesized and real audio. Previous works recovered latent vectors of given audio through an auto-encoder inspired technique that trains an encoder network either in parallel with the GAN or after the generator is trained. With our approach, we train a deep residual neural network architecture to project audio synthesized by WaveGAN into the corresponding latent space with near identical reconstruction performance. To accommodate for the lack of an original latent vector for real audio, we optimize the residual network on the perceptual loss between the real audio samples and the reconstructed audio of the predicted latent vectors. In the case of synthesized audio, the Mean Squared Error (MSE) between the ground truth and recovered latent vector is minimized as well. We further investigated the audio reconstruction performance when several gradient optimization steps are applied to the predicted latent vector. Through our deep neural network based method of training on real and synthesized audio, we are able to predict a latent vector that corresponds to a reasonable reconstruction of real audio. Even though we evaluated our method on WaveGAN, our proposed method is universal and can be applied to any other GANs.

Full PDF

LLatent Vector Recovery of Audio GANs

Andrew Keyes* Nicky Bayat* Vahid Reza Khazaie Yalda Mohsenzadeh

Western UniversityLondon, ON, [email protected] Western UniversityLondon, ON, [email protected] Western UniversityLondon, ON, [email protected] Western UniversityLondon, ON, [email protected]

Abstract

Advanced Generative Adversarial Networks(GANs) are remarkable in generating intelli-gible audio from a random latent vector. Inthis paper, we examine the task of recoveringthe latent vector of both synthesized and realaudio. Previous works recovered latent vec-tors of given audio through an auto-encoderinspired technique that trains an encoder net-work either in parallel with the GAN or af-ter the generator is trained. With our ap-proach, we train a deep residual neural net-work architecture to project audio synthe-sized by WaveGAN into the correspondinglatent space with near identical reconstruc-tion performance. To accommodate for thelack of an original latent vector for real audio,we optimize the residual network on the per-ceptual loss between the real audio samplesand the reconstructed audio of the predictedlatent vectors. In the case of synthesized au-dio, the Mean Squared Error (MSE) betweenthe ground truth and recovered latent vectoris minimized as well. We further investigatedthe audio reconstruction performance whenseveral gradient optimization steps are ap-plied to the predicted latent vector. Throughour deep neural network based method oftraining on real and synthesized audio, weare able to predict a latent vector that corre-sponds to a reasonable reconstruction of realaudio. Even though we evaluated our methodon WaveGAN, our proposed method is uni-versal and can be applied to any other GANs.

Researchers have recently shown an increased interestin mapping generated samples by generative adversar-ial networks (GANs) into the original latent vectors. GANs consist of two major components: the gener-ator and the discriminator. The generator receivesa random latent vector and generates realistic sam-ples through a min-max optimization game with thediscriminator. The discriminator aims to determinewhether a sample is real or fake. The competition be-tween the two networks helps to enhance their respec-tive performances. After training, for a given randomvector z, GANs generate realistic samples. The prob-lem of GAN inversion refers to the task of recovering alatent vector for a given sample (image, audio, video,etc.) that when given to the generator, it can preciselyregenerate the target.Inverse mapping of GANs has a pivotal role in a widerange of scientiﬁc and industrial applications. Thisinversion is of interest because it can be used as ameasure for comparing GANs’ performances (Creswelland Bharath, 2018). If a GAN is capable of generatingsamples that can be mapped to more precise recoveredlatent vectors, it has learned richer information fromthe training distribution and thus has a better per-formance compared to the other GANs. Furthermore,since modifying latent vectors results in meaningfulchanges in the image/audio domain (Radford et al.,2015), the inverse mapping of GANs can be used inclassiﬁcation or retrieval problems. By transformingthe latent vector of an image/audio, one can add stylesor features to the generated sample. For example, itis possible to generate more memorable images withGANs by linearly transforming the latent representa-tion (Goetschalckx et al., 2019). One possible exam-ple of style transfer in audio is changing the speakeridentiﬁer of audio while maintaining the content of thespeech (Van Den Oord et al., 2017).Projecting generated samples back to the latent-spaceis a classic problem in computer vision, where the goalis to recover the latent vector of given images. How-ever, investigating the inverse mapping of GANs is stilla continuing concern within the audio domain. Whilemethods introduced in the vision domain can be ap-plied to audio as well, they need to be improved and a r X i v : . [ c s . S D ] O c t atent Vector Recovery of Audio GANs ﬁne-tuned to this domain in order to produce idealresults.Four major solutions have been proposed for project-ing images back to the latent vector peers. Thereis a large and growing body of the literature thatrecovers the latent vector through an optimization-based approach (Creswell and Bharath, 2018). Thistechnique updates an initial random z vector usinggradient descent until the loss between the samplegenerated by this vector and the target is less thana desired threshold. Auto-encoder inspired architec-tures are the second most common solution to thisproblem (Van Den Oord et al., 2017)(Donahue et al.,2016)(Dumoulin et al., 2016). A third encoder networkis trained to map generated samples to latent vectors.Next, the generator (decoder) uses these vectors to re-generate the target sample.Deep neural networks are a recent tool for learningthe inverse mapping of GANs with high ﬁdelity andspeed. They can be trained on a dataset of gener-ated samples and matching latent vectors (Bayat et al.,2020). Hybrid methods are another approach thatbeneﬁts from the advantages of both mentioned tech-niques (Zhu et al., 2016). They ﬁrst predict an initiallatent vector via a deep network and later update viaa few further steps of gradient descent.It has previously been observed that it is possible tomap synthesized images to latent vectors that regen-erate indistinguishable results from the target. Themain challenge faced by many researchers is the lackof ground truth latent vectors for real samples. Inthis paper, we propose a residual deep neural networkbased method that predicts latent vectors of audiosamples using a ResNet-18 architecture. Our inversemapping model is trained based on a combination ofpixel and perceptual loss. Later, we demonstrate ourresults on latent vector recovery of both generated andreal audio samples and compare them with sole gra-dient descent optimization and hybrid methods thatapply a few steps of gradient descent after the initialprediction by the deep neural network methods.To the best of our knowledge, the experimental workpresented here provides one of the ﬁrst investigationsinto how to project audio samples to latent represen-tations that can regenerate them. In this paper ourcontributions are the following: • We propose a fast deep network technique to re-cover latent vectors of both fake and real audio. • We propose a novel perceptual loss within the au-dio domain. • We also perform latent vector recovery using gra-dient descent methods in the audio domain. • We implement a hybrid method that applies a fewsteps of gradient descent after predicting the rep-resentation by the deep network and compare theresults with the proposed ResNet-18 frameworkand sole gradient descent approach.

Recent advances in the inverse mapping of adversar-ial generators, in particular inverse mapping of imagegenerators, have led to the question of whether it ispossible to map audio to a latent vector that can re-generate it. Despite its many applications, there hasbeen no detailed investigation on recovering the la-tent vectors of audio. This paper attempts to showthat common approaches used for projecting imagesto latent-space can be used on audio, but they needto be ﬁne-tuned and adjusted for this task. There arefour major approaches proposed to invert the genera-tors of GANs.There is a large volume of published studieson gradient-based inversion methods (Creswell andBharath, 2018). In order to ﬁnd the optimal latent vec-tor, they start from a random vector z ∗ sampled fromthe training distribution. At every iteration, they up-date this representation to minimize the loss betweenthe generated sample G ( z ∗ ) and the target until theloss is below a certain threshold. While this methodcan recover accurate latent vectors, it often takes toolong to ﬁnd the solution. (Lipton and Tripathi, 2017)proposed the stochastic clipping technique to boundthe latent vectors to the original distribution whilebeing updated by gradient descent. Any value largerthan the maximum number allowed or smaller thanthe minimum, is replaced by a random number withinthe acceptable range.A considerable amount of literature has been publishedon auto-encoder inspired architectures (Van Den Oordet al., 2017)(Donahue et al., 2016)(Dumoulin et al.,2016). They learn the inverse mapping by trainingan encoder either at the same time as the GAN orafter the GAN is trained. The encoder maps gener-ated samples back to the latent domain. The approachof training an encoder alongside the GAN cannot beused for pre-trained GANs. In addition, adding a thirdnetwork adds extra parameters to the training whichmight cause over-ﬁtting.More recent attention has focused on using deep neu-ral networks to predict latent vectors. (Bayat et al.,2020) proposed a deep residual network (ResNet-18)to learn the inverse mapping based on both recon-struction and perceptual loss. Pixel loss between pre-dicted and ground truth latent vectors is used as thereconstruction loss. The perceptual loss between the ndrew Keyes*, Nicky Bayat*, Vahid Reza Khazaie, Yalda Mohsenzadeh Figure 1: The Two Branch Architecture of Our Deep Network Based Inversion Technique for Real Audio.reconstructed and target images are computed basedon Mean Squared Error (MSE) between features ex-tracted from concatenation layers of a pre-trainedFaceNet (Schroﬀ et al., 2015).A number of studies have begun to examine a hybridmethod of inverting GANs (Zhu et al., 2016). Usingthis approach, researchers have been able to beneﬁtfrom the advantages of both optimization and deepneural network based methods. This approach ﬁrstpredicts the latent vector using a pre-trained deep neu-ral network, then ﬁne-tune the prediction using a fewsteps of gradient descent. Even though this methodworks well on images, we demonstrate that sole deepnetwork predictions perform better than hybrid opti-mizations.In this paper, we build on (Bayat et al., 2020), and em-ploy the framework to recover latent vectors for syn-thetic and real audio. We compare our method withboth optimization-based and hybrid inversion methodsand show our superiority.

A number of techniques have been developed for recov-ering latent vectors of images. Including optimization-based methods, auto-encoder inspired architectures,and deep neural network models. A major advantageof using deep residual neural networks is that they are incredibly fast when compared to gradient-based al-ternatives. The second advantage of using the ResNetarchitecture is that we can train it on a dataset ofgenerated audio and their corresponding ground truthlatent vector to learn the inverse mapping. In thiswork, the ResNet-18 architecture previously proposedfor image latent vector recovery (Bayat et al., 2020) isassayed for audio domain using generated audio andmatching latent representation as well as a novel per-ceptual loss for audio.

WaveGAN was the ﬁrst attempt at applying GANs tounsupervised synthesis of raw-waveform audio (Don-ahue et al., 2018). The model is capable of produc-ing one second clips at a frequency of 16khz learnedfrom a variety of audio datasets. WaveGAN makesuse of transpose convolutions to upsample from a la-tent space to the ﬁnal raw-waveform audio. To avoidcheckerboard artifacts in audio, a phase shuﬄe methodis applied to perturb the phase of the discriminator’sactivations by -n to n. We chose to use the oﬃ-cial WaveGAN model pre-trained on the SC09 (subsetof Speech Commands) dataset for two reasons (War-den, 2018). First from a qualitative perspective, itwas much easier to distinguish whether the style andcontent of a reproduced spoken digit remained intactrather than drums or bird calls. Second, the percep- atent Vector Recovery of Audio GANs tual loss could be taken from feature maps of a pre-trained classiﬁer on the SC09 dataset. Also, predict-ing the latent vector of real speech can be useful forapplying style or content transformations through atransformer in the latent space.

Our goal for the inverse mapping model was to take asinput the spectrogram of generated and real audio andoutput the predicted latent vector for a pre-trainedWaveGAN. The inverse mapping model architecture isthe same as ResNet-18 with the last activation func-tion removed and the number of classes set to the sizeof the WaveGAN latent space. The ResNet-18 archi-tecture was selected as it has shown to be useful in therecovery of latent vectors in face GANs (Bayat et al.,2020). At each epoch, the inverse mapping model wastrained equally on audio synthesized by WaveGAN atrun time from a random latent vector and real audiofrom the SC09 dataset. The architecture of our pro-posed method is presented in Figure 1.In the case of synthesized audio, two losses were usedto train the network. The ﬁrst was the MSE betweenthe original latent vector used as input to the Wave-GAN and the predicted latent vector by the inversemapping model. Second, the perceptual loss of theaudio synthesized by the original latent vector and thereconstructed audio from the predicted latent vectorwas calculated through a classiﬁer model trained onpredicting the spoken digit. For real audio, only theperceptual loss was used as we do not have a true la-tent vector. The MSE loss in combination with theperceptual loss helped to learn the inverse mappingfrom real and synthesized audio to the latent spacewhile maintaining content and style.

To train the inverse mapping model, we employed twotypes of objective functions: • MSE between latent vectors:

At run time, au-dio was synthesized by WaveGAN from a randomlatent vector with a uniform distribution between-1 and 1. With the synthesized audio as input tothe inverse mapping model, the latent vector waspredicted in the last layer of the network. Theground truth latent vector was then compared tothe predicted latent vector through mean squarederror. This loss is necessary to learn the inversemapping between the audio and latent vectors. • Perceptual loss:

For real and synthesized audio,the perceptual loss was back-propagated throughthe inverse mapping model. The perceptual loss Figure 2: Comparing the Performance of Deep Net-work Inverse Mapping Model with Gradient DescentApproach in Latent Vector Recovery of Fake Audio.Three Example Digits are Shown Here.was quantiﬁed as the diﬀerence in activations in aResNet-18 spoken digit classiﬁer’s output at eachresidual block. The classiﬁer was pre-trained onthe SC09 dataset. This loss improved the recon-struction of real audio and helped keep style andcontent intact.

In this section, we compare the inverse mappingResNet-18 model with sole gradient descent and hy-brid methods based on MSE distance, SSIM measureand digit classiﬁcation accuracy.

Our generated audio dataset was created at run timeby generating a random latent vector with a uniformdistribution between -1 and 1. The latent vector wasused as input to WaveGAN to generate raw-waveformaudio. For real audio, we used the SC09 dataset whichcontains spoken digits from zero to nine. The trainingset includes 18620 samples with roughly equal occur-rences of each digit by many diﬀerent speakers. ndrew Keyes*, Nicky Bayat*, Vahid Reza Khazaie, Yalda Mohsenzadeh

Table 1: Synthesized Audio ReconstructionsInception Score(mean ± std) MSE(raw audio) SSIM(spectrogram)Fake 7.23 ± ± ± ± The inverse mapping model was trained for 250 epochswith an Adam optimizer using the learning rate of0.001 and a batch size of 64. The only modiﬁcationsto the ResNet-18 architecture was the removal of theﬁnal activation function and the number of classes be-ing set to the size of WaveGAN’s latent space. Duringeach epoch, the model was ﬁrst trained on one batchof real audio with perceptual loss and then one batchof WaveGAN synthesized audio with MSE between la-tent vectors and perceptual loss between the originalaudio and reconstructions.

In order to recover the latent representation of a givenaudio using this method, we ﬁrst sample a randomvector, z ∗ , from the uniform distribution. Then wefeed this latent vector to WaveGAN to get the cor-responding audio. Next, we convert the audio to aspectrogram and compute the Mean Absolute Error(MAE) loss between the generated spectrogram andthe target spectrogram. Afterward, z ∗ is updated bythe SciPy optimize library using the L-BFGS algo-rithm (Liu and Nocedal, 1989). After each iteration,the updated z ∗ is closer to the ground truth latentvector. Equation 1 depicts the optimization process. W stands for pre-trained WaveGAN and S convertsaudio to spectrogram. The main disadvantage of theoptimization methods is that they are very slow. Nu-merous iterations are required to ﬁnd the optimumlatent vector for a target sample; further, ﬁnding theperfect hyperparameters (learning rate and number ofiterations) is extremely challenging. To recover latentvectors of real audio, we limited z ∗ to 50000 gradientsteps. min z ∗ = 1 n (cid:88) ||S ( W ( z )) − S ( W ( z ∗ )) || (1) An initial objective of this project was to map gen-erated audio back to latent space. The purpose ofExperiment 1 was to evaluate the performance of in-verse mapping model trained using pixel and percep-tual loss in latent vector recovery of synthesized audio.The results in Table 1 show we were able to recover thelatent vector of synthesized audio with high accuracy.Since the gradient-based method is optimized on theMAE of the spectrograms, the metrics may appear toshow a better reconstruction through gradient-basedoptimization. In actuality, the reconstruction throughour inverse mapping model produces a more coherentspoken digit with better content and style reconstruc-tion. This can be shown through the inception scoreof our inverse mapping model approach and is shownin Experiment 3 with real audio maintaining its digit atent Vector Recovery of Audio GANs

Table 2: Real Audio ReconstructionsInception Score(mean ± std) MSE(raw audio) SSIM(spectrogram) AccuracyReal 9.20 ± ± ± ± As mentioned earlier, optimization methods are ca-pable of updating a random latent vector to recoverthe desired latent vector that reconstructs the tar-get audio. We implemented an optimization-basedmethod that updates a random vector for a maximumof 50000 steps using gradient descent. Finding theperfect learning rate for optimization-based approachis extremely challenging. In addition, we developed ahybrid method that ﬁrst predicts a latent vector usingour inverse mapping model and then updates it fur-ther with a maximum of 200 steps of gradient descent.The comparison between these three methods is sum-marized in Table 1 for fake audio. Although the hybridmethod produces a lower loss for the reconstruction ofraw-waveform and spectrograms, the reproduced au-dio is of worse auditory quality. This indicates theinverse mapping model captures the important styleand content of the audio while discarding noise.Overall, these results indicate that our inverse map-ping model performs better than optimization-basedmethods.

Very little was found in the literature on how to recon-struct real audio using GAN inversion techniques. Amajor problem with mapping real audio is that thereare no ground truth latent vectors that accurately gen-erate them. Another issue is the limitation of the audioGANs themselves, they are not capable of generatinghigh quality naturalist audio that sound identical tothe real peers. Nevertheless, our proposed approachis universal and can later be applied to invert moreadvanced GANs. The results of comparing our in- verse mapping model with the alternatives on project-ing real audio to latent representations is summarizedin Table 2. What stands out in the table is that whilegradient-based methods reconstruct the waveform andspectrograms well, in terms of classiﬁcation accuracy,our inverse mapping model is signiﬁcantly better andis closest to the real audio accuracy.Together these results provide important insights intothe latent vector recovery in the audio domain.

The present research aimed to examine diﬀerent ap-proaches of inverse mapping audio GANs. We pro-posed a deep residual neural network based approachto project both synthesized and real audio into latentspace. We trained a ResNet-18 architecture based ona combination of reconstruction and perceptual losses.This study has identiﬁed that deep network solutionsare not only faster than optimization-based alterna-tives, but also more accurate in terms of both auditoryquality of the reconstructed sounds and digit classiﬁ-cation accuracy.Taken together, these ﬁndings suggest a role for deepneural networks in promoting inverse mapping of au-dio GANs. Being limited to current audio GANs, thisstudy lacks high quality reconstructed audio for realsounds.

Acknowledgements

The authors would like to thank Western BrainsCANfor the generous support of this research. The studywas conducted on Compute Canada resources. A.Kis supported by the Vector Scholarship in ArtiﬁcialIntelligence, provided through the Vector Institute.

References

Nicky Bayat, Vahid Reza Khazaie, and YaldaMohsenzadeh. Inverse mapping of face gans. arXivpreprint arXi:2009.05671 , 2020.Antonia Creswell and Anil Anthony Bharath. Invert-ing the generator of a generative adversarial network. ndrew Keyes*, Nicky Bayat*, Vahid Reza Khazaie, Yalda Mohsenzadeh

IEEE transactions on neural networks and learningsystems , 30(7):1967–1974, 2018.Chris Donahue, Julian J. McAuley, and Miller S.Puckette. Synthesizing audio with generative adver-sarial networks.

CoRR , abs/1802.04208, 2018. URL http://arxiv.org/abs/1802.04208 .Jeﬀ Donahue, Philipp Kr¨ahenb¨uhl, and Trevor Dar-rell. Adversarial feature learning. arXiv preprintarXiv:1605.09782 , 2016.Vincent Dumoulin, Ishmael Belghazi, Ben Poole,Olivier Mastropietro, Alex Lamb, Martin Arjovsky,and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704 , 2016.Lore Goetschalckx, Alex Andonian, Aude Oliva, andPhillip Isola. Ganalyze: Toward visual deﬁnitionsof cognitive image properties. In

Proceedings of theIEEE International Conference on Computer Vision ,pages 5744–5753, 2019.Zachary C Lipton and Subarna Tripathi. Precise re-covery of latent vectors from generative adversarialnetworks. arXiv preprint arXiv:1702.04782 , 2017.Dong C Liu and Jorge Nocedal. On the limited mem-ory bfgs method for large scale optimization.

Mathe-matical programming , 45(1-3):503–528, 1989.Alec Radford, Luke Metz, and Soumith Chintala. Un-supervised representation learning with deep convolu-tional generative adversarial networks. arXiv preprintarXiv:1511.06434 , 2015.Florian Schroﬀ, Dmitry Kalenichenko, and JamesPhilbin. Facenet: A uniﬁed embedding for face recog-nition and clustering. In

Proceedings of the IEEEconference on computer vision and pattern recogni-tion , pages 815–823, 2015.Aaron Van Den Oord, Oriol Vinyals, et al. Neu-ral discrete representation learning. In

Advances inNeural Information Processing Systems , pages 6306–6315, 2017.Pete Warden. Speech commands: A datasetfor limited-vocabulary speech recognition. arXivpreprint arXiv:1804.03209 , 2018.Jun-Yan Zhu, Philipp Kr¨ahenb¨uhl, Eli Shechtman,and Alexei A Efros. Generative visual manipulationon the natural image manifold. In