CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo
CCycleGAN-VC3:Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo
NTT Communication Science Laboratories, NTT Corporation, Japan [email protected]
Abstract
Non-parallel voice conversion (VC) is a technique for learningmappings between source and target speeches without using aparallel corpus. Recently, cycle-consistent adversarial network(CycleGAN)-VC and CycleGAN-VC2 have shown promisingresults regarding this problem and have been widely used asbenchmark methods. However, owing to the ambiguity of theeffectiveness of CycleGAN-VC/VC2 for mel-spectrogram con-version, they are typically used for mel-cepstrum conversioneven when comparative methods employ mel-spectrogram asa conversion target. To address this, we examined the ap-plicability of CycleGAN-VC/VC2 to mel-spectrogram conver-sion. Through initial experiments, we discovered that theirdirect applications compromised the time-frequency structurethat should be preserved during conversion. To remedy this,we propose CycleGAN-VC3, an improvement of CycleGAN-VC2 that incorporates time-frequency adaptive normalization(TFAN). Using TFAN, we can adjust the scale and bias of theconverted features while reflecting the time-frequency structureof the source mel-spectrogram. We evaluated CycleGAN-VC3on inter-gender and intra-gender non-parallel VC. A subjectiveevaluation of naturalness and similarity showed that for everyVC pair, CycleGAN-VC3 outperforms or is competitive withthe two types of CycleGAN-VC2, one of which was applied tomel-cepstrum and the other to mel-spectrogram. Index Terms : voice conversion (VC), non-parallel VC, gen-erative adversarial networks (GANs), CycleGAN-VC, mel-spectrogram conversion
1. Introduction
Voice conversion (VC) is a technique for converting non/para-linguistic information in speech while retaining the linguisticinformation. VC has been actively studied owing to its potentialdiverse applications such as in speaking aids [1, 2], speech en-hancement [3, 4], and accent conversion [5]. Machine-learningapproaches are widely used and include statistical methodsbased on Gaussian mixture models [6, 7] and neural networks(NNs), including feedforward NNs [8], recurrent NNs [9], con-volutional NNs (CNNs) [5], attention networks [10, 11], andgenerative adversarial networks (GANs) [5].For ease of learning, many VC methods (including theabove-mentioned methods) learn mappings from source to tar-get speeches using a parallel corpus. However, collection ofsuch data is not necessarily easy or practical. Additionally, evenif such data are collected, the time alignment procedure requiredin most VC methods remains a challenge.As an alternative, non-parallel VC, which does not require aparallel corpus for training, has gained attention recently. Non- Audio samples are available at . (d) CycleGAN-VC3Frames M e l - s p ec t r og r a m Frames Frames Frames(a) Souce (SM) (b) Target (TF) (c) CycleGAN-VC2
Figure 1:
Comparison of source, target, and converted mel-spectrograms. In the red box, CycleGAN-VC2 (c) compromisesthe harmonic structure that appears in the source (a) and target(b), whereas CycleGAN-VC3 (d) succeeds in retaining it. Notethat achieving this is not trivial because parallel data, such as(a) and (b), were not available for training. parallel VC is attractive in terms of data collection cost; how-ever, its learning is challenging owing to the absence of explicitsupervision. To address this, several studies have utilized lin-guistic information [12, 13, 14, 15]. Although this extra super-vision improves the performance, auxiliary data or modules arerequired to extract linguistic information.To avoid such a requirement and achieve non-parallel VCusing only acoustic data, variational autoencoder-based meth-ods [16, 17, 18] and GAN-based methods [17, 19] have beenproposed. Among them, CycleGAN-VC [19, 20] has gar-nered attention alongside its variants (CycleGAN-VC2 [21] andStarGAN-VCs [22, 23, 24]), and they have been widely usedas benchmark methods in several studies (e.g., [15, 25, 26]).However, owing to the ambiguity of their effectiveness formel-spectrogram conversion, they are typically used for mel-cepstrum conversion even when comparative methods employmel-spectrogram as a conversion target (e.g., [15, 26]).These facts motivated us to examine the applicabilityof CycleGAN-VC [19] and CycleGAN-VC2 [21] to mel-spectrogram conversion. Through initial experiments, we foundthat when CycleGAN-VC/VC2 is directly applied to a mel-spectrogram, it compromises the time-frequency structure thatshould be preserved during conversion, as shown in Figure 1.To address this, we propose CycleGAN-VC3 that is an im-provement of CycleGAN-VC2 that incorporates time-frequencyadaptive normalization (TFAN) . TFAN is inspired by spatiallyadaptive (de)normalization (SPADE) [27] that was originallyproposed for semantic image synthesis. We revise SPADEfor application to 1D and 2D time-frequency features. UsingTFAN, we can adjust the scale and bias of the converted fea-tures while reflecting the time-frequency structure of the sourcemel-spectrogram.We examined the effectiveness of the CycleGAN-VC3 oninter-gender and intra-gender non-parallel VC using the VoiceConversion Challenge 2018 (VCC 2018) dataset [28]. A sub-jective evaluation of naturalness and similarity showed that for a r X i v : . [ c s . S D ] O c t very VC pair, CycleGAN-VC3 achieved a better or competi-tive performance compared with the two types of CycleGAN-VC2, one of which was applied to mel-cepstrum and the otherto mel-spectrogram.The remainder of this paper is organized as follows. In Sec-tion 2, we briefly review conventional CycleGAN-VC/VC2. InSection 3, we introduce the proposed CycleGAN-VC3. In Sec-tion 4, we describe the experimental results. Section 5 presentsa concise summary and description of future work.
2. Conventional CycleGAN-VC/VC2
CycleGAN-VC/VC2 aims at learning a mapping G X → Y thatconverts source acoustic features x ∈ X into target acous-tic features y ∈ Y without using a parallel corpus. Inspiredby CycleGAN [29], originally proposed for unpaired image-to-image translation, CycleGAN-VC/VC2 learns a mapping us-ing an adversarial loss [30], cycle-consistency loss [31], and identity-mapping loss [32]. Furthermore, CycleGAN-VC2 usesa second adversarial loss to improve the details of the recon-structed features. Adversarial loss.
To ensure that the converted feature G X → Y ( x ) is in the target Y , an adversarial loss L X → Yadv is usedas follows: L X → Yadv = E y ∼ P Y [log D Y ( y )]+ E x ∼ P X [log(1 − D Y ( G X → Y ( x )))] , (1)where the discriminator D Y attempts to distinguish synthe-sized G X → Y ( x ) from real y by maximizing the loss, whereas G X → Y attempts to synthesize G X → Y ( x ) that can deceive D Y by minimizing the loss. Similarly, the inverse mapping G Y → X and the discriminator D X are trained adversarially us-ing L Y → Xadv . Cycle-consistency loss.
To preserve the composition in conver-sion, a cycle-consistency loss L cyc is used as follows: L cyc = E x ∼ P X [ (cid:107) G Y → X ( G X → Y ( x )) − x (cid:107) ]+ E y ∼ P Y [ (cid:107) G X → Y ( G Y → X ( y )) − y (cid:107) ] . (2)This loss is used with a hyper-parameter λ cyc , which controlsits relative importance. The loss aids G X → Y and G Y → X toidentify pseudo pairs within the cycle-consistency constraint. Identity-mapping loss.
To facilitate input preservation, anidentity-mapping loss L id is used as follows: L id = E y ∼ P Y [ (cid:107) G X → Y ( y ) − y (cid:107) ]+ E x ∼ P X [ (cid:107) G Y → X ( x ) − x (cid:107) ] . (3)This loss is used with a hyper-parameter λ id , which controls itsrelative importance. Second adversarial loss.
In CycleGAN-VC2, to mitigate thestatistical averaging caused by the L1 loss (Equation 2), an addi-tional discriminator D (cid:48) X is introduced, and a second adversarialloss L X → Y → Xadv is imposed on the circularly converted featuresas follows: L X → Y → Xadv = E x ∼ P X [log D (cid:48) X ( x )]+ E x ∼ P X [log(1 − D (cid:48) X ( G Y → X ( G X → Y ( x ))))] . (4)Similarly, the discriminator D (cid:48) Y is introduced, and L Y → X → Yadv is imposed on the inverse-forward mapping. CycleGAN-VC uses
1D CNN generators [5] to capture theoverall relationship along with the feature direction while pre-serving the temporal structure. In particular, the network iscomposed of downsampling, residual [33], and upsamplingblocks to capture wide-range temporal relationships effectively,and gated linear units (GLUs) [34] are used as activation to learnsequential and hierarchical structure adaptively.However, a study on CycleGAN-VC2 [21] showed that 1DCNNs in downsampling and upsampling blocks affect the struc-ture that should be retained in conversion. To alleviate this,CycleGAN-VC2 introduces that uses 2D CNNsin upsampling and downsampling blocks and uses 1D CNNsin residual blocks. The former is used for extracting the time-frequency structure while preserving the original structure. Thelatter is used for performing dynamic changes.
CycleGAN-VC uses
2D CNN discriminators [35] to discrimi-nate data based on 2D spectral textures. In particular, it uses
FullGAN , which has a fully connected layer as the last layer, todiscriminate data based on the overall input structure. However,in FullGAN, the need to learn many parameters causes learningdifficulty. To mitigate this, CycleGAN-VC2 introduces
Patch-GAN [36] that uses convolution at the last layer. This reducesthe parameters and stabilizes the GAN training.
3. CycleGAN-VC3
CycleGAN-VC and CycleGAN-VC2 were originally designedfor mel-cepstrum conversion, and their effectiveness in mel-spectrogram conversion is not sufficiently examined. We em-pirically examined their effectiveness and discovered that theycompromise the time-frequency structure that should be pre-served in conversion, as shown in Figure 1.Motivated by this finding, we devised TFAN that extendsinstance normalization (IN) [37] to adjust the scale and bias ofthe converted features while reflecting the source information(i.e., x ) in a time- and frequency-wise manner. In particular,we designed TFAN for 1D and 2D time-frequency features tobe used in 2-1-2D CNN (Section 2.2). Figure 2 illustrates thearchitectures of TFAN. Given feature f , TFAN normalizes itin a channel-wise manner similar to IN and then modulates thenormalized feature in an element-wise manner using scale γ ( x ) and bias β ( x ) , which are calculated from x using CNNs: f (cid:48) = γ ( x ) f − µ ( f ) σ ( f ) + β ( x ) , (5)where f (cid:48) is the output feature, and µ ( f ) and σ ( f ) are thechannel-wise average and standard deviation of f , respectively.In IN, x -independent scale β and bias γ are applied in a channel-wise manner , whereas in TFAN, those calculated from x (i.e., β ( x ) and γ ( x ) ) are applied in an element-wise manner .These differences allow TFAN to adjust the scale and bias of f while reflecting x in a time- and frequency-wise manner.Note that TFAN is inspired by SPADE [27], which wasoriginally proposed for semantic image synthesis. The maindifferences are that (1) SPADE was devised for 2D image fea-tures, whereas TFAN was designed for both 1D and 2D time-frequency features, (2) SPADE uses a one-layer CNN in thecomponent shown in the dashed-line box in Figure 2 because esize [2D Conv+ReLU] × N
2D Conv2D Conv (b) 2D TFAN
Instance Norm × N x β ( x ) γ ( x ) f f ′ T x Q x C f Q f T f Q f T f C h C f C f h (w/o β and γ ) Element-wise manipulation (a) 1D TFAN Resize 1D Conv1D ConvInstance Norm [1D Conv+ReLU] × N × N x β ( x ) γ ( x ) f f ′ T x T f T f C h C f C f C f C x (= Q x ) h (w/o β and γ ) Element-wise manipulation Figure 2:
Architectures of 1D and 2D TFAN. x , f , f (cid:48) , and h indicate the source mel-spectrogram, input feature, output fea-ture, and hidden-layer feature, respectively. T , Q , and C withsubscripts denote the time, frequency, and channel dimensionsin the corresponding feature, respectively. We omit them whentheir values are the same as those in the previous step. N rep-resents the number of layers in the dashed-line box. After f is normalized in a channel-wise manner, it is modulated in anelement-wise manner using scale γ ( x ) and bias β ( x ) , whichare calculated using either (a) 1D CNN or (b) 2D CNN. the drastic changes are not required in semantic image syn-thesis, whereas TFAN uses a multi-layer CNN to ensure thedynamic change, and (3) SPADE is based on batch normaliza-tion [38], whereas TFAN is based on IN. We examine the effectsof (1) and (2) and present our findings in Section 4.2. In CycleGAN-VC3, we incorporated TFAN into theCycleGAN-VC2 generator (i.e., 2-1-2D CNN (Section 2.2)).Particularly, IN in the 1D →
2D block and that in the upsam-pling block were replaced with 1D TFAN and 2D TFAN,respectively, as shown in Figure 3. In TFAN, we set the numberof channels (i.e., C h ) and kernel size in h (Figure 2) to 128and 5, respectively. We examine the performance with varying N and adjusting the position where the TFAN is inserted andpresent our findings in Section 4.2. The discriminator wasthe same as that used in CycleGAN-VC2 (i.e., PatchGAN(Section 2.3)). R e s hape1 x C on v D TF AN C on v I N G L U PS Upsample (2D)1D→2D S ou r c e R e s hape1 x C on v I N (a) CycleGAN-VC2 C on v D TF AN G L U PS Upsample (2D)1D→2D (b) CycleGAN-VC3 S ou r c e Figure 3:
Comparison of 1D →
2D and upsampling blocks be-tween (a) CycleGAN-VC2 and (b) CycleGAN-VC3. See Figure4 in [21] for details of the overall network architectures.
4. Experiments
We evaluated CycleGAN-VCs on the Spoke (i.e., non-parallel VC) task of VCC 2018 [28], which contains record-ings of professional US English speakers. We selected a sub-set of speakers that considered all inter-gender and intra-genderVC: VCC2SF3 ( SF ), VCC2SM3 ( SM ), VCC2TF1 ( TF ), andVCC2TM1 ( TM ), where S , T , F , and M represent source, tar-get, female, and male, respectively. Combinations of 2 sources × Conversion process.
One aim in this study was to examine thefeasibility of using CycleGAN-VCs for mel-spectrogram con-version. Hence, we used CycleGAN-VCs for mel-spectrogramconversion and synthesized waveforms using the pretrainedMelGAN vocoder [39]. We did not alter the parameters ofthe vocoder such that we could focus on the evaluation of mel-spectrogram conversion; however, fine-tuning them for eachspeaker is a possible means for improvement.
Network architectures.
As the acoustic feature is changedfrom mel-cepstrum to mel-spectrogram, the feature dimen-sion increased from 35 to 80. However, the generators ofCycleGAN-VCs are fully convolutional; therefore, they can beused without modifying the network architecture. Regardingthe discriminators, we used the same network architecture asthose for mel-cepstrum conversion, except that in CycleGAN-VC2/VC3, the kernel size in the second-last convolutional layerwas doubled in the frequency direction (see Figure 4 in [21] fordetails of the original network architectures).
Training settings.
The training settings were similar to thoseused in CycleGAN-VC/VC2 for mel-cepstrum conversion [19,21]. For preprocessing, we normalized the mel-spectrogramsusing the mean and variance of the training data. We used theleast square GAN [40] as the GAN objective. We trained thenetworks for k iterations using the Adam optimizer [41]with a batch size of 1. A training sample consisted of randomlycropped 64 frames (approximately 0.75 s). The learning rateswere set to 0.0002 for the generators and 0.0001 for the dis-criminators with momentum terms β and β of 0.5 and 0.999,respectively. λ cyc and λ id were set to 10 and 5, respectively,and L id was used only for the first k iterations. Note thatsimilar to the original CycleGAN-VC/VC2, we did not use ex-tra data, modules, or time alignment procedures for training . We conducted an objective evaluation to investigate the effectof TFAN parameter selection and the performance differenceamong CycleGAN-VCs. Direct measurement of the differencebetween the target and converted mel-spectrograms is difficultbecause their alignment is not trivial. As an alternative, weused two evaluation metrics that are commonly used in pre- https://github.com/descriptinc/melgan-neurips able 1: Comparison of MCD and MSD using (a) different-depth TFAN, (b) TFAN in different positions, and (c) differ-ent models. The results are listed as MCD [dB]/MSD [dB],where bold and italic numbers indicate the best and second-bestscores, respectively. (a) Depth ( N ) SF-TF SM-TM SF-TM SM-TF1 7.55/ / / / / / / / / / (b) Position SF-TF SM-TM SF-TM SM-TF1D →
2D 7.79/ /1.93 7.05/1.82 / Upsampling / / / / / / (c) Model SF-TF SM-TM SF-TM SM-TFV1 7.91/1.60 7.85/2.21 7.56/2.04 8.45/1.74V2 /1.56 / / / V3 / / / / V2U 7.69/ vious VC [21, 23, 25]: the mel-cepstral distortion (MCD) ,which measures the global structural difference based on the tar-get and converted mel-cepstra, and modulation-spectra distance(MSD) , which assesses the local structural difference based onthe target and converted modulation spectra of mel-cepstra. Forboth metrics, the smaller the value, the better the performance.35-dimensional mel-cepstrum parameters were extracted fromthe target or converted waveform using WORLD [42].
Effect of TFAN parameter selection.
We initially examinedthe effect of the TFAN parameter selection. In particular, weinvestigated the performance with varying the depth of TFAN( N in Figure 2) and the position where the TFAN is inserted(1D →
2D and/or upsampling blocks (see Figure 3)). Table 1(a)and (b) list the respective results. Our major findings are asfollows. (1) Comparison of different-depth TFAN (Table 1(a)).
We found that (i) 1-depth TFAN shows the worst scores exceptfor SF-TF, where the performance is comparable with that ofthe others, and (ii) the scores reach their peak at approximately N = 3 . This indicates the importance of performing dynamicchanges using a multi-layer CNN. This differs from semanticimage synthesis with SPADE [27] applied. (2) Comparison ofTFAN positions (Table 1(b)). We found that the joint usage of1D TFAN and 2D TFAN in 1D →
2D and upsampling blocks isthe most effective. Therefore, we set N to 3 and used TFAN inboth positions in the following experiments. Comparison among CycleGAN-VCs.
We analyzed the per-formance difference among the CycleGAN-VCs. In addition toCycleGAN-VC ( V1 ), -VC2 ( V2 ), and -VC3 ( V3 ), we examinedCycleGAN-VC2 with U-net [43] ( V2U ), which might be usefulas an alternative to TFAN to propagate the source informationto converted features. Table 1(c) summarizes the results. V2 and V3 generally have the best or second-best scores. Amongthe two, V3 showed a better performance in most cases in termsof both metrics. Furthermore, we show the qualitative and sub-jective comparisons in Figure 1 and Section 4.3, respectively. We conducted listening tests to assess the applicability ofCycleGAN-VCs in mel-spectrogram conversion. We com-pared four models: CycleGAN-VC2 with mel-cepstrum con-version (the current best and benchmark model; denoted by B ) and CycleGAN-VC, -VC2, and -VC3 with mel-spectrogramconversion (denoted by V1 , V2 , and V3 , respectively). To SF-TF SM-TM SF-TM SM-TF GT (a) Naturalness
SF-TF SM-TM SF-TM SM-TF GT (b) Speaker similarity M O S M O S Figure 4:
MOS for (a) naturalness and (b) speaker similaritywith 95% confidence intervals. measure the naturalness and speaker similarity, we conductedmean opinion score (MOS) tests, where we included the tar-get ground-truth speech ( GT ) as anchor samples. In both tests,all evaluation data (35 utterances) were used for evaluation, andeach utterance was evaluated once. In the speaker similaritytest, we paired the converted speech and target speech (of whichutterance contents were different) in random order. Nine and 11listeners participated in the naturalness and speaker similaritytests, respectively. Audio samples are available online. Figure 4 shows the results. Our main findings are asfollows. (1) Comparison between CycleGAN-VC2 for mel-cepstrum (B) and that for mel-spectrogram (V2). V2 is par-ticularly effective for inter-gender VC (SF-TM and SM-TF) interms of naturalness, and for the speaker similarity, the resultsare case-dependent. This indicates that the direct applicationto the mel-spectrogram (i.e., V2 ) is not necessarily reasonable. (2) Comparison among CycleGAN-VCs (V1, V2, and V3). Inmost cases, the performance improves in terms of both met-rics as the version increases. This confirms the utility of V2 and V3 compared with the previous versions. (3) Effectivenessof CycleGAN-VC3 (V3). V3 has a better or competitive per-formance compared with the other models. This indicates thepotential use of V3 as a new benchmark method in lieu of B .
5. Conclusions
Although CycleGAN-VCs are widely used as benchmark meth-ods, their feasibility for mel-spectrogram conversion is notsufficiently examined. Therefore, following our examination,we proposed CycleGAN-VC3, an improvement of CycleGAN-VC2 that incorporates TFAN. Using this module, we can con-vert the mel-spectrogram while adaptively reflecting the sourcemel-spectrogram. The experimental results indicate the poten-tial of using CycleGAN-VC3 as a new benchmark method toreplace CycleGAN-VC2. Application of TFAN to advancedtasks, such as multi-domain VC [22, 23, 24] and application-side VC [1, 2, 3, 4, 5], remains an interesting future direction.
6. Acknowledgements
This work was supported by JSPS KAKENHI 17H01763 andJST CREST Grant Number JPMJCR19A3, Japan. In the naturalness test, 1 = bad, 2 = poor, 3 = fair, 4 = good, and5 = excellent. In the speaker similarity test, 1 = very different, 2 =moderately different, 3 = fair, 4 = moderately similar, and 5 = verysimilar. . References [1] A. B. Kain, J.-P. Hosom, X. Niu, J. P. H. van Santen, M. Fried-Oken, and J. Staehely, “Improving the intelligibility of dysarthricspeech,”
Speech Commun. , vol. 49, no. 9, pp. 743–759, 2007.[2] K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using GMM-based voice conversion for electrolaryn-geal speech,”
Speech Commun. , vol. 54, no. 1, pp. 134–146, 2012.[3] Z. Inanoglu and S. Young, “Data-driven emotion conversion inspoken English,”
Speech Commun. , vol. 51, no. 3, pp. 268–283,2009.[4] T. Toda, M. Nakagiri, and K. Shikano, “Statistical voice con-version techniques for body-conducted unvoiced speech enhance-ment,”
IEEE Trans. Audio Speech Lang. Process. , vol. 20, no. 9,pp. 2505–2517, 2012.[5] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino,“Sequence-to-sequence voice conversion with similarity metriclearned using generative adversarial networks,” in
Proc. Inter-speech , 2017, pp. 1283–1287.[6] Y. Stylianou, O. Capp´e, and E. Moulines, “Continuous probabilis-tic transform for voice conversion,”
IEEE Trans. Speech AudioProcess. , vol. 6, no. 2, pp. 131–142, 1998.[7] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion basedon maximum-likelihood estimation of spectral parameter trajec-tory,”
IEEE Trans. Audio Speech Lang. Process. , vol. 15, no. 8,pp. 2222–2235, 2007.[8] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad,“Spectral mapping using artificial neural networks for voice con-version,”
IEEE Trans. Audio Speech Lang. Process. , vol. 18, no. 5,pp. 954–964, 2010.[9] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion usingdeep bidirectional long short-term memory based recurrent neuralnetworks,” in
Proc. ICASSP , 2015, pp. 4869–4873.[10] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “AttS2S-VC:Sequence-to-sequence voice conversion with attention and con-text preservation mechanisms,” in
Proc. ICASSP , 2019, pp. 6805–6809.[11] H. Kameoka, K. Tanaka, T. Kaneko, and N. Hojo, “ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion,”
IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 28, pp.1849–1863, 2020.[12] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic poste-riorgrams for many-to-one voice conversion without parallel datatraining,” in
Proc. ICME , 2016, pp. 1–6.[13] F.-L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNN-based approach to voice conversion without parallel training sen-tences,” in
Proc. Interspeech , 2016, pp. 287–291.[14] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Non-parallelvoice conversion using variational autoencoders conditioned byphonetic posteriorgrams and d-vectors,” in
Proc. ICASSP , 2018,pp. 5274–5278.[15] J. Zhang, Z. Ling, and L.-R. Dai, “Non-parallel sequence-to-sequence voice conversion with disentangled linguistic andspeaker representations,”
IEEE/ACM Trans. Audio Speech Lang.Process. , vol. 28, pp. 540–552, 2020.[16] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang,“Voice conversion from non-parallel corpora using variationalauto-encoder,” in
Proc. APSIPA ASC , 2016, pp. 1–6.[17] ——, “Voice conversion from unaligned corpora using variationalautoencoding Wasserstein generative adversarial networks,” in
Proc. Interspeech , 2017, pp. 3364–3368.[18] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACVAE-VC:Non-parallel voice conversion with auxiliary classifier variationalautoencoder,”
IEEE/ACM Trans. Audio Speech Lang. Process. ,vol. 27, no. 9, pp. 1432–1443, 2019.[19] T. Kaneko and H. Kameoka, “Parallel-data-free voice conversionusing cycle-consistent adversarial networks,” in arXiv preprintarXiv:1711.11293 , Nov. 2017.[20] ——, “CycleGAN-VC: Non-parallel voice conversion usingcycle-consistent adversarial networks,” in
Proc. EUSIPCO , 2018,pp. 2114–2118. [21] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “CycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion,”in
Proc. ICASSP , 2019, pp. 6820–6824.[22] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC:Non-parallel many-to-many voice conversion using star genera-tive adversarial networks,” in
Proc. SLT , 2018, pp. 266–273.[23] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “StarGAN-VC2: Rethinking conditional methods for StarGAN-based voiceconversion,” in
Proc. Interspeech , 2019, pp. 679–683.[24] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Non-parallelvoice conversion with augmented classifier star generative adver-sarial networks,” arXiv preprint arXiv:2008.12604 , Aug. 2020.[25] S. Lee, B. Ko, K. Lee, I.-C. Yoo, and D. Yook, “Many-to-manyvoice conversion using conditional cycle-consistent adversarialnetworks,” in
Proc. ICASSP , 2020, pp. 6279–6283.[26] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Auto-VC: Zero-shot voice style transfer with only au-toencoder loss,” in
Proc. ICML , 2019, pp. 5210–5219.[27] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic imagesynthesis with spatially-adaptive normalization,” in
Proc. CVPR ,2019, pp. 2337–2346.[28] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen-cio, T. Kinnunen, and Z. Ling, “The voice conversion challenge2018: Promoting development of parallel and nonparallel meth-ods,” in
Proc. Odyssey , 2018, pp. 195–202.[29] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in
Proc. ICCV , 2017, pp. 2223–2232.[30] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-sarial nets,” in
Proc. NIPS , 2014, pp. 2672–2680.[31] T. Zhou, P. Kr¨ahenb¨uhl, M. Aubry, Q. Huang, and A. A. Efros,“Learning dense correspondence via 3D-guided cycle consis-tency,” in
Proc. CVPR , 2016, pp. 117–126.[32] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domainimage generation,” in
Proc. ICLR , 2017, pp. 1–15.[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in
Proc. CVPR , 2016, pp. 770–778.[34] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Languagemodeling with gated convolutional networks,” in
Proc. ICML ,2017, pp. 933–941.[35] T. Kaneko, H. Kameoka, N. Hojo, Y. Ijima, K. Hiramatsu, andK. Kashino, “Generative adversarial network-based postfilter forstatistical parametric speech synthesis,” in
Proc. ICASSP , 2017,pp. 4910–4914.[36] C. Li and M. Wand, “Precomputed real-time texture synthesiswith Markovian generative adversarial networks,” in
Proc. ECCV ,2016, pp. 702–716.[37] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normaliza-tion: The missing ingredient for fast stylization,” in arXiv preprintarXiv:1607.08022 , July 2016.[38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in
Proc.ICML , 2015, pp. 448–456.[39] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,J. Sotelo, A. de Br´ebisson, Y. Bengio, and A. Courville, “Mel-GAN: Generative adversarial networks for conditional waveformsynthesis,” in
Proc. NeurIPS , 2019, pp. 14 881–14 892.[40] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley,“Least squares generative adversarial networks,” in
Proc. ICCV ,2017, pp. 2794–2802.[41] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in
Proc. ICLR , 2015, pp. 1–15.[42] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time appli-cations,”
IEICE Trans. Inf. Syst. , vol. 99, no. 7, pp. 1877–1884,2016.[43] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in