[PDF] Axial Residual Networks for CycleGAN-based Voice Conversion

Abstract

We propose a novel architecture and improved training objectives for non-parallel voice conversion. Our proposed CycleGAN-based model performs a shape-preserving transformation directly on a high frequency-resolution magnitude spectrogram, converting its style (i.e. speaker identity) while preserving the speech content. Throughout the entire conversion process, the model does not resort to compressed intermediate representations of any sort (e.g. mel spectrogram, low resolution spectrogram, decomposed network feature). We propose an efficient axial residual block architecture to support this expensive procedure and various modifications to the CycleGAN losses to stabilize the training process. We demonstrate via experiments that our proposed model outperforms Scyclone and shows a comparable or better performance to that of CycleGAN-VC2 even without employing a neural vocoder.

Full PDF

AAxial Residual Networks for CycleGAN-based Voice Conversion

Jaeseong You, Gyuhyeon Nam, Dalhyun Kim, Gyeongsu Chae

MoneyBrain Inc. { jaeseongyou, ngh3053, torch, gc } @moneybrain.ai Abstract

We propose a novel architecture and improved training ob-jectives for non-parallel voice conversion. Our proposedCycleGAN-based model performs a shape-preserving transfor-mation directly on a high frequency-resolution magnitude spec-trogram, converting its style (i.e. speaker identity) while pre-serving the speech content. Throughout the entire conversionprocess, the model does not resort to compressed intermedi-ate representations of any sort (e.g. mel spectrogram, low res-olution spectrogram, decomposed network feature). We pro-pose an efﬁcient axial residual block architecture to supportthis expensive procedure and various modiﬁcations to the Cy-cleGAN losses to stabilize the training process. We demon-strate via experiments that our proposed model outperformsScyclone and shows a comparable or better performance to thatof CycleGAN-VC2 even without employing a neural vocoder.

Index Terms : voice conversion, generative adversarial net-works, CycleGAN, axial convolution

1. Introduction

Voice conversion (VC) is a technique to convert the speakeridentity of an input speech to a different one while preserv-ing the linguistic content. Even fairly recent VC models of-ten necessitate a parallel dataset, a set of texts spoken twice,respectively by the source speaker and the target speaker [1].Such parallel corpora are hard to obtain, and the issue of align-ing the tempo and the prosody between a pair of utterancesfurther complicates the matter. Latest studies on VC there-fore concern the more challenging task of learning from non-parallel utterances. Many strategies have been proposed to thisend: sequence-to-sequence architectures that explicitly decom-pose linguistic content via automatic speech recognition (ASR)and text-to-speech (TTS) [2, 3], variational autoencoder(VAE)-based methods [4, 5], the style-changing StarGAN [6] appliedto the domain of voice [7, 8], networks based on CycleGAN [9]that cycle between two voice identities [10, 11, 12, 13]. Ourproposed model is of the last category.The most closely related work to our proposed model isScyclone [14], from which it inherits a CycleGAN-based 1D-convolution-only architecture that manages a linear magnitudespectrogram directly. The authors of Scyclone claims to outper-form the previous state-of-the-art CycleGAN-based voice con-version model, CycleGAN-VC2 [12], in both similarity and nat-uralness. Its architecture, however, does not scale effectivelywith the increasing frequency resolution of spectrogram, lead-ing to exploding gradients (conﬁrmed in repeated primary ex-periments). Moreover, the model responds sensitively to a sam-ple rate higher than 16kHz or non-parallel utterances that dif-fer signiﬁcantly in their phonetic composition; it quickly de-teriorates in quality and fails to preserve speech content. Fi-nally, the receptive ﬁeld of the 1D convolutional neural network(CNN) residual block–that is commonly used across VC net- works [10, 11, 12, 13]–is limited to maneuver low frequencysignals with high ﬁdelity.Our proposed model overcomes the difﬁculties by intro-ducing modiﬁcations to the architectural design of the residualblock and the composition of loss functions. By carrying thespectral information without channel-wise inﬂation or deﬂation,the model evinces that even the high-resolution data can be ef-fectively processed at once with neither the multi-level repre-sentation of audio feature in different time scales at the expenseof computation and memory [15, 16] nor the low-resolutionrepresentation at the cost of performance [14]. This shape-preserving approach resembles that of distribution-modifyingstatistical models [17, 18], opening up possible extensions toﬂow or diffusion, but such discussion is outside the scope of thispaper. Our major contributions here are four-fold: 1) propos-ing an efﬁcient axial residual block architecture with an ex-tended receptive ﬁeld in the temporal axis 2) utilizing a highfrequency-resolution spectrogram as input without resorting toa downscaled spectrogram 3) enabling the direct spectrogram-to-spectrogram conversion that preserves the data shape end-to-end both time- and channel-wise 4) introducing various modiﬁ-cations to the CycleGAN loss that stabilize the training processand improve the ﬁdelity of the resulting voice.

2. Model Description

The conventional residual block architecture (employed in bothCycleGAN-VC and Scyclone) is not effective in the challeng-ing task of transforming one form of high frequency-resolutionspectrogram to another without intermediate compressed repre-sentation, due to its limited receptive ﬁeld. To overcome thisdifﬁculty, our proposed axial residual block is composed of thetemporal axis convolution layer and the frequency axis convo-lution layer. Figure 1 illustrates the details. The architecturaldesign bears high similarity to the axial attention [19] in thatthe receptive ﬁeld is efﬁciently widened in two-step axial op-eration by capturing information along one of the two axes ata time. The major difference from such self-attention variants,however, is that the kernel is not dynamically computed everytime frame, which allows to process an input of variable lengthunlike in a Transformer block where the input length should beﬁxed in advance. It also resembles spatially separable convolu-tion where the expressivity of one convolution layer is approxi-mated in two steps for computational efﬁciency [20].The authors of [21] observe that there is a qualitative dif-ference between the receptive ﬁeld widened over layers and theone that is wide from the beginning, emphasizing the advantageof the latter. In addition, Kong et al. underline that phonemicrelations in speech often exceed over 100ms [16]. Hence, anideal VC model should be able to process signal informationthat lie over the 100ms window from the very ﬁrst layer, insteadof reaching the scope in successive layers. For the temporal axis a r X i v : . [ ee ss . A S ] F e b i.e. x-axis) convolution layer, we therefore employ depth-wiseconvolution so that each kernel learns to process the changesin one frequency band independently of others. This compu-tational efﬁciency allows an extremely large kernel size. Ourchoice is 17, which translates to 230ms (or 4.35Hz given the22.05kHz sample rate). Every block thus can capture long-termcorrespondences (low-frequency signals) that occur across overa couple of thousand samples, and we view this capability to becritical for the improved performance.Figure 1: The conventional 1D-convolution residual block (left)and the proposed axial residual block (right). The dotted linesindicate optional components, and k, s, g denote kernel, stride,and group respectively. The axial block in the ﬁgure is equippedwith a depth-wise convolution to process x-axis information,which can be further streamlined at a minimal compromise inperformance if replaced with lightweight convolution. The y-axis convolution, while illustrated to have the kernel size of 1,can be extended to a larger kernel for optimal performance.

A viable alternative to this depth-wise convolution fortransformation in the time domain is the lightweight convo-lution [22]. While it is highly similar to the depth-wise 1Dconvolution where each kernel is learned channel-wise, thelightweight convolution further improves efﬁciency by sharingthe same kernel over a speciﬁed number of channels. First pro-posed in the domain of natural language processing, it has beensuccessfully adopted in processing audio information [23, 24];when applied to the domain of audio, the inductive bias of the1D lightweight convolution is that information in nearby fre-quency bands can be effectively processed with a shared opera-tion, which well approximates the workings of human auditorycognition.The frequency-axis (i.e. y-axis) convolution layer is a stan-dard 1x1 1D convolution; the only particularity is that the num-ber of kernels is the same as the number of input channels. Thisis in fact equivalent to a position-wise feedforward layer, re-sulting in kernels that focus on processing the information atone speciﬁc time-point only. The inductive bias here is thatchanging one voice to another can be successfully realized byapplying one nonlinear transformation communally across allthe spectral frames. To increase the capacity and further widenthe overall temporal receptive ﬁeld, we enlarge the kernel sizeto 3 for our experiments. The vertical strap thus obtains moreexpressivity regarding temporal locality.Residual connections can be made either once between theinput and the output of the block, which we choose for our ex-periments, or twice between the input and the outputs of the twoconvolution layers. With the latter, the axial residual block at-tains closer connections to the original Transformer block [25].The only difference lies in that it employs either the depth-wise or lightweight convolution layer instead of the self-attentionlayer. In fact, replacing the self-attention layer of the Trans-former block with dynamic lightweight convolution has beenproposed in [26], showing improvements when applied to earlylayers. One can interpret our proposed model as its static vari-ant where the kernel is static across the temporal dimension andthe replacement is applied to the entire stack instead of only toearly layers.

While CycleGAN-based VC models often employ a global dis-criminator that outputs one prediction averaged over the entireutterance (presumably for stable training), it tends to lead tovarious artifacts in the synthesized output: notably, outbursts ofphonemic fragments and momentary distortion of content. Wehypothesize that this is because the generator brings in shortnon-sensible fragments from the source audio to take advantageof the discriminator, which can yield only global predictions.To reduce these artifacts, we employ a patchGAN style lossinstead. The effectiveness of the patchGAN adversarial lossis consistent with the ﬁndings in speech-related GAN models[15, 16, 12], and we thus take the patchGAN form to the ex-treme with regard to the temporal dimension, outputting a dis-criminative prediction per spectrogram frame.Adversarial loss, while being the key component in trans-forming the style, tends to induce serious instability duringtraining. CycleGAN-based VC models thus tend to use a modi-ﬁed GAN loss such as least square loss [27] or hinge loss [28] tocurb down the high volatility. In adjoinment with the PatchGANdesign [29] extended to the maximal resolution in the time do-main, however, our preliminary experiments indicate that usinga plain binary cross entropy objective is more effective than theaforementioned variants. The adversarial loss in our proposedmodel is therefore deﬁned as follows: L adv ( G X → Y , D Y ) = E y ∼ P Y ( y ) [log D Y ( y )] (1) + E x ∼ P X ( x ) [log(1 − D Y ( G Y → X )] The cycle-consistency loss penalizes the case where the re-sult converted twice severely deviates from the original utter-ance (i.e. after the style has changed from the source to thetarget, and back to the source once more). The original consis-tency loss is deﬁned as follows: L cyc ( G X → Y , G Y → X ) = E x ∼ P X ( x ) [ || G Y → X ( G X → Y ( x )) − x || ] (2) + E y ∼ P Y ( y ) [ || G X → Y ( G Y → X ( y )) − y || ] However, the content reconstruction often breaks when thetwo training corpora differ much in terms of phonetic composi-tion (inter-language in the extreme case) and style (inter-sex inthe extreme case). We therefore augment the loss with auxiliaryfeature matching. In other words, we add a stronger constraintthat all the pairs of the residual block outputs of the correspond-ing discriminator should be similar for the cycled output and theoriginal utterance: L cyc ( G X → Y , G Y → X ) = E x ∼ P X ( x ) [ || G Y → X ( G X → Y ( x )) − x || ]+ E y ∼ P Y ( y ) [ || G X → Y ( G Y → X ( y )) − y || ] (3) + E x ∼ P X ( x ) [ || D fX ( G Y → X ( G X → Y ( x ))) − D fX ( x ) || ] E y ∼ P Y ( y ) [ || D fY ( G X → Y ( G Y → X ( y ))) − D fY ( y ) || ] The identity loss, ﬁrst proposed in [30], regularizes the gen-erator to approximate an identity mapping when real samplesof the target domain are fed. The expected effects are that themodel conservatively responds to an unknown distribution andbetter preserves phonetic contents in general. The original iden-tity loss is deﬁned as follows: L id ( G X → Y , G Y → X ) =+ E y ∼ P Y ( y ) [ || G X → Y ( y ) − y || ] (4) + E x ∼ P X ( x ) [ || G Y → X ( x ) − x || ] We extend the identity loss by applying the identity-mapping not only for real samples, but also for the transformedutterances. In this manner, the generator is exposed to addi-tional types of unknown distributions: L id ( G X → Y , G Y → X ) = E y ∼ P Y ( y ) [ || G X → Y ( y ) − y || ]+ E x ∼ P X ( x ) [ || G Y → X ( x ) − x || ]+ (5) + E x ∼ P X ( x ) [ || G X → Y ( G X → Y ( x )) − G X → Y ( x ) || ]+ E x ∼ P X ( x ) [ || G Y → X ( G Y → X ( y )) − G Y → X ( y ) || ] The total loss therefore consists of the three aforementionedcomponents: the adversarial loss, the cycle-consistency loss,and the identity loss. The three components are linearly scaledwith their corresponding lambda values to control their relativecontribution to the total loss: L total = λ adv ∗ L adv + λ cyc ∗ L cyc + λ id ∗ L id (6) The generator architecture is fully 1D-CNN. The prenet convo-lution layer, a 1x1 convolution, mixes the channel informationin the same manner for every time point. The intermediate stackconsists of 7 axial residual blocks. Within each block, in be-tween the temporal convolution and the frequency convolution,non-linearity is introduced with leaky ReLU activation with theslope of 0.01. The postnet convolution once again mixes thechannel information followed by the ﬁnal relu activation, whichfacilitates the generator to synthesize a linear magnitude spec-trogram that should take only positive values.The architectural design of the discriminator is mostlyequivalent to that of Scyclone [14]. Gaussian noise with thestandard deviation of 0.01 is added to the input to stabilize thetraining process [31]. Additionally, spectral normalization isapplied to the weights of all the convolution layers [32], to pre-vent the discriminator from learning too quickly by capping themagnitude of possible change: W SN = Wσ ( W ) , σ ( W ) = max h : h (cid:54) =0 || W h || || h || (7)The subsequent structure following the prenet layer consistsof 5 original residual blocks. Each of them introduces non-linearity with leaky ReLU with the larger slope of 0.2. The logitvalues are then output from the postnet convolution. Figure 2shows the overall architecture.

3. Experiments

We compare our proposed model to CycleGAN-VC2 [12] andScyclone [14] for English and Korean voice conversion. For Figure 2:

Schematic architecture of generator (left) and dis-criminator (right). The number of blocks and the detailed con-ﬁguration of each block can be further ﬁne-tuned for optimalperformance. the former, we use the multi-speaker VCTK dataset [33]. Mic1recordings of two females (p299 and p301) and two males (p311and p360) are selected and downsampled to 22.05kHz (for ourproposed model) or to 16kHz (for CycleGAN-VC2 and Scy-clone). The 24 parallel utterances shared by the four speakersare reserved for evaluation, and the rest are utilized for training.We normalize the audio, trim its beginning silence, and pad ze-ros at the end if needed to meet the following data shape. The128-frame 513-channel linear magnitude spectrogram is com-puted using the window length of 1024 and the hop size of 256samples with Hann window.For the latter Korean voice conversion, we use KoreanSingle-Speaker dataset (KSS) [34] and our proprietary Koreanfemale dataset (JEY). After removing excessively short sam-ples, we use 11,000 utterances per identity for training, reserv-ing 584 and 372 samples for evaluation, which are sampled ran-domly since the two datasets share no parallel utterances. Theremaining audio preprocessing procedures are identical to thoseof the former experiment.The two experiments share the same training process. Twopairs of discriminator and generator are initiated. One genera-tor converts voice from identity X to identity Y, and the otherdoes the opposite. The discriminator paired with the formertells whether a voice sample of identity Y is real or convertedfrom identity X, and the one paired with the latter does the samefor the identity X. For both the generators and the discrimina-tors, we use the Adam optimizer [35] with α = 2 . × − , β = 0 . , and β = 0 . . The learning rates are annealed bya factor of . every 50 epochs. Each model is trained for 200epochs with the batch size of 16. The scaling factors of the totalloss λ adv , λ cyc , and λ id are set to . , . , and . respectively.For Scyclone and CycleGAN-VC2, we follow the same trainingsetting as proposed in the original papers.Finally, when converting a resulting spectrogram to a wave-form, one can expect a substantial performance gain from usinga neural vocoder (e.g. WaveGlow[17], WaveGrad[18]). We,however, compute the ﬁnal audio with the grifﬁn-lim algorithm[36] with 32 iterations. This is to compare the models on anequal footing. CycleGAN-VC2, however, is inherently depen-dent on the use of the WORLD vocoder [37], so we make anexception for the model. To compare the structural similarity of the converted outcomeand the ground truth, we use mel spectral distortion (MSD)metrics. While a converted utterance does not perfectly alignwith its true correspondence, we apply dynamic time warpingigure 3:

MOS test on three pairs of the four selected VCTK speakers. (DTW) to pair positions where the two sequences are most sim-ilar. For the computation of the metrics, we use 40-channellog-mel spectrogram based on a 46.44ms Hann window and ahop size of 11.61ms. The mel ﬁlters span from 0 to 8,000 Hz.In the case of the Korean voice conversion experiment wherethe DTW sequence paring is impossible due to the lack of par-allel utterances, we compare the MSD of each of the convertedutterances to the ground truth MSD averaged per identity.Word error rate (WER) measures how well each model pre-serves the linguistic content during the conversion process, andit is computed based on the comparison of the ASR result of theconverted utterance to the ground truth text. We select commer-cial ASR models based on the overall stability for each experi-ment, [38] for VCTK and [39] for Korean, respectively.In addition to this objective evaluation, we conduct humanlistening tests for subjective evaluation of the converted speech.Each of the 10 subjects is asked to provide his or her mean opin-ion score (MOS) (5: very similar to 1: very dissimilar), evalu-ating how similar the converted speech is to the real utteranceof the target identity. A rater evaluates a pair of a true utteranceand a converted speech, repeating for 24 utterances per voiceidentity for the VCTK experiment. In the Korean experiment,we use 10 randomly sampled instances per identity.Table 1:

Average MSD and WER results on VCTK voice con-version. The results are organized by the conversion from theformer person ID to the latter person ID.

Scyclone VC2 ours299-301 MSD 2.55 ± .18 ± .17 2.38 ± .22WER 0.94 ± .04 ± .03 0.07 ± .03301-299 MSD 2.26 ± .18 2.22 ± .22 ± .22WER 0.95 ± .04 ± .04 0.10 ± .08311-360 MSD 1.99 ± .13 1.83 ± .16 ± .16WER 0.97 ± .02 ± .05 0.15 ± .07360-311 MSD 2.29 ± .18 ± .15 2.06 ± .15WER 0.99 ± .01 0.10 ± .04 ± .04299-311 MSD 2.32 ± .16 2.08 ± .18 ± .15WER 0.98 ± .01 ± .05* 0.33 ± .08311-299 MSD 2.71 ± .19 ± .27 2.58 ± .18WER 0.98 ± .02 ± .06* 0.30 ± .10In terms of changing the voice identity, Table 1 and Fig-ure 3 show that our proposed model outperforms Scyclone andachieves a comparable performance to that of CyclegGAN-VC2in the VCTK experiment. An asterisk denotes a statisticallysigniﬁcant top result. Table 2 and Figure 4 summarize the voiceconversion experiment on Korean, in which our proposed modeloutperforms both Scyclone and CyclegGAN-VC2 in changing Figure 4: MOS test on KSS-JEY voice conversion and viceversa. styles even without a neural vocoder. We hypothesize that, sinceour proposed model handles high-resolution data without com-pression, it is more data-greedy to train to a sufﬁcient level.VCTK, where only less than 400 utterances are available peridentity, is thus not prone to fully realizing the capability ofour proposed architecture. While CycleGAN-VC2 consistentlydoes well on WER, we suspect this is because the model tendsto introduce changes to style conservatively.Table 2:

Average MSD and WER results on KSS-JEY voice con-version and vice versa.

Scyclone VC2 OursKSS-JEY MSD 1.63 ± .50 1.61 ± .24 ± .45WER 0.85 ± .25 ± .06* 0.30 ± .12JEY-KSS MSD 1.73 ± .42 ± .29 1.39 ± .41WER 0.75 ± .33 ± .20 0.36 ± .19

4. Conclusions

We propose a fully 1D CNN network to process a highfrequency-resolution spectrogram directly with its shape un-changed. In comparison to the conventional residual block em-ployed across CycleGAN-based VC models, our proposed ax-ial residual block extends the temporal receptive ﬁeld whilelowering memory and computation requirements. We addressmeaningful similarities and connections to various recent ar-chitectural solutions such as axial attention and Transformerblock. The effectiveness of the proposed design is demonstratedthrough various experiments; our model outperforms Scycloneon all the evaluation metrics. It shows comparable or betterperformance to that of CycleGAN-VC2 even without a neuralvocoder, especially in terms of speaker similarity with largerKorean datasets. . References [1] S. Mohammadi and A. Kain, “An overview of voice conversionsystems,”

Speech Communication , vol. 88, pp. 65–82, 2017.[2] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc:Sequence-to-sequence voice conversion with attention and con-text preservation mechanisms,”

IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , pp. 6805–6809, 2019.[3] W. Huang, T. Hayashi, Y. Wu, H. Kameoka, and T. Toda, “Voicetransformer network: Sequence-to-sequence voice conversion us-ing transformer with text-to-speech pretraining,” arXiv preprintarXiv:1912.06813 , 2019.[4] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Acvae-vc:Non-parallel many-to-many voice conversion with auxiliary clas-siﬁer variational autoencoder,”

IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , pp. 1432–1443, 2018.[5] P. Tobing, Y. Wu, T. Hayashi, K. Kobayashi, and T. Toda, “Non-parallel voice conversion with cyclic variational autoencoder,”

IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pp. 674–678, 2019.[6] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo, “Stargan:Uniﬁed generative adversarial networks for multi-domain image-to-image translation,”

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pp. 8789–8797, 2018.[7] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan: Uni-ﬁed generative adversarial networks for multi-domain image-to-image translation,”

IEEE Spoken Language Technology Workshop(SLT) , pp. 266–273, 2018.[8] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Stargan-vc2:Rethinking conditional methods for stargan-based voice conver-sion,”

INTERSPEECH , pp. 679–273, 2019.[9] J. Zhu, T. Park, P. Isola, and A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,”

IEEE In-ternational Conference on Computer Vision (ICCV) , pp. 2223–2232, 2017.[10] T. Kaneko and H. Kameoka, “Parallel-data-free voice conver-sion using cycle-consistent adversarial network,” arXiv preprintarXiv:1711.11293 , 2017.[11] ——, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” , pp. 2100–2104, 2018.[12] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cycleganvc2:Improved cyclegan-based non-parallel voice conversion,”

IEEEInternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , pp. 6820–6824, 2019.[13] ——, “Cyclegan-vc3: Examining and improving cyclegan-vcs formel-spectrogram conversion,”

INTERSPEECH , pp. 2017–2021,2020.[14] M. Tanaka, T. Nose, A. Kanagaki, R. Shimizu, and A. Ito, “Scy-clone: High-quality and parallel-data-free voice conversion us-ing spectrogram and cycle-consistent adversarial networks,” arxivpreprint arxiv:2005.03334 , 2020.[15] K. Kumar, R. Kumar, T. Boissiere, L. Gestin, W. Teoh, J. Sotelo,A. Br´ebisson, Y. Bengio, and A. Courville, “Melgan: Gener-ative adversarial networks for conditional waveform synthesis,”

Advances in Neural Information Processing Systems (NeurIPS) ,2019.[16] J. Kong, J. Kim, and J. Bae, “Hiﬁ-gan: Generative adversarial net-works for efﬁcient and high ﬁdelity speech synthesis,”

Advancesin Neural Information Processing Systems (NeurIPS) , 2020.[17] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A ﬂow-based generative network for speech synthesis,,”

IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2019.[18] N. Chen, Y. Zhang, H. Zen, R. Weiss, M. Norouzi, and W. Chan,“Wavegrad: Estimating gradients for waveform generation,” arXiv preprint arXiv:2009.00713 , 2020. [19] J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans, “Ax-ial attention in multidimensional transformers,” arxiv preprintarxiv:1912.12180 , pp. 6820–6824, 2019.[20] F. Mamalet and C. Garcia, “Simplifying convnets for fast learn-ing,”

International Conference on Artiﬁcial Neural Networks(ICANN 2012) , pp. 58–65, 2012.[21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold,S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth16x16 words: Transformers for image recognition at scale,” arxivpreprint arXiv:2010.11929 , 2020.[22] F. Wu, A. Fan, A. Baevski, Y. Dauphin, and M. Auli, “Pay less at-tention with lightweight and dynamic convolutions,”

InternationalConference on Learning Representations (ICLR) , 2019.[23] I. Elias, H. Zen, J. Shen, Y. Zhang, Y. Jia, R. Weiss, and Y. Wu,“Parallel tacotron: Non-autoregressive and controllable tts,” arXivpreprint arXiv:2010.11439 , 2020.[24] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec2.0: A framework for self-supervised learning of speech repre-sentations,”

Advances in Neural Information Processing Systems(NeurIPS) , 2020.[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all youneed,”

Advances in Neural Information Processing Systems , pp.6000–6010, 2017.[26] C. Yun, S. Bhojanapalli, A. Rawat, S. Reddi, and S. Kumar, “Aretransformers universal approximators of sequence-to-sequencefunctions?” arXiv preprint arXiv:1912.10077 , 2019.[27] X. Mao, Q. Li, H. Xie, R. Lau, Z. Wang, and S. Smolley, “Leastsquares generative adversarial networks,”

International Confer-ence on Computer Vision (ICCV) , pp. 2794–2802, 2017.[28] J. Lim and J. Ye, “Geometric gan,” arXiv preprintarXiv:1705.02894 , 2017.[29] P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-image transla-tion with conditional adversarial networks,”

Conference on Com-puter Vision and Pattern Recognition (CVPR) , pp. 1125–1134,2017.[30] Y. Taigman, A. Polyak, and L. Wolf, “Towards principled methodsfor training generative adversarial networks,”

International Con-ference on Learning Representations (ICLR) , 2017.[31] M. Arjovsky and L. Bottou, “Towards principled methods fortraining generative adversarial networks,”

International Confer-ence on Learning Representations (ICLR) , 2017.[32] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectralnormalization for generative adversarial networks,”

InternationalConference on Learning Representations (ICLR) , 2018.[33] K. M. J. Yamagishi, C. Veaux, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),”

Uni-versity of Edinburgh. The Centre for Speech Technology Research(CSTR)

International Conference on Learning Representations(ICLR) , 2015.[36] D. Grifﬁn and J. Lim, “Signal estimation from modiﬁed short timefourier transform,”

IEEE Trans. Acoust., Speech, Signal Process ,pp. 236–243, 1984.[37] M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based high-quality speech synthesis system for real-time appli-cations,”