[PDF] MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

Abstract

Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion despite recent advances in mel-spectrogram vocoders. To overcome this, CycleGAN-VC3, an improved variant of CycleGAN-VC2 that incorporates an additional module called time-frequency adaptive normalization (TFAN), has been proposed. However, an increase in the number of learned parameters is imposed. As an alternative, we propose MaskCycleGAN-VC, which is another extension of CycleGAN-VC2 and is trained using a novel auxiliary task called filling in frames (FIF). With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames. This task allows the converter to learn time-frequency structures in a self-supervised manner and eliminates the need for an additional module such as TFAN. A subjective evaluation of the naturalness and speaker similarity showed that MaskCycleGAN-VC outperformed both CycleGAN-VC2 and CycleGAN-VC3 with a model size similar to that of CycleGAN-VC2. Audio samples are available at this http URL

Full PDF

MMASKCYCLEGAN-VC:LEARNING NON-PARALLEL VOICE CONVERSION WITH FILLING IN FRAMES

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

NTT Communication Science Laboratories, NTT Corporation, Japan

ABSTRACT

Non-parallel voice conversion (VC) is a technique for train-ing voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VCand CycleGAN-VC2) are widely accepted as benchmarkmethods. However, owing to their insufﬁcient ability tograsp time-frequency structures, their application is limitedto mel-cepstrum conversion and not mel-spectrogram conver-sion despite recent advances in mel-spectrogram vocoders.To overcome this, CycleGAN-VC3, an improved variantof CycleGAN-VC2 that incorporates an additional mod-ule called time-frequency adaptive normalization (TFAN),has been proposed. However, an increase in the number oflearned parameters is imposed. As an alternative, we pro-pose MaskCycleGAN-VC, which is another extension ofCycleGAN-VC2 and is trained using a novel auxiliary taskcalled ﬁlling in frames (FIF). With FIF, we apply a temporalmask to the input mel-spectrogram and encourage the con-verter to ﬁll in missing frames based on surrounding frames.This task allows the converter to learn time-frequency struc-tures in a self-supervised manner and eliminates the needfor an additional module such as TFAN. A subjective eval-uation of the naturalness and speaker similarity showed thatMaskCycleGAN-VC outperformed both CycleGAN-VC2and CycleGAN-VC3 with a model size similar to that ofCycleGAN-VC2. Index Terms — Voice conversion (VC), non-parallel VC,generative adversarial networks (GANs), CycleGAN-VC,mel-spectrogram conversion

1. INTRODUCTION

Voice conversion (VC) is a technique for translating one voiceinto another without changing the linguistic content, and hasbeen extensively studied owing to its various applications, in-cluding speaking assistance [1, 2], speech enhancement [3, 4],and accent conversion [5, 6]. Machine-learning-based ap-proaches have been widely used, ranging from statisticalmodeling (e.g., Gaussian mixture models [7, 8]) to neural net-works (NNs) (e.g., feedforward NNs [9], recurrent NNs [10],convolutional NNs [6], and attention networks [11, 12, 13]).Many VC methods (including those above) are catego-rized into parallel VC approaches and train a converter be-tween the source and target speakers using parallel utterances. Audio samples are available at . Parallel VC has the advantage that it can train a converter ina supervised manner; however, it requires a parallel corpus,which is not always easy to collect.As an alternative, non-parallel VC, a technique for train-ing a converter without a parallel corpus, has attracted at-tention, and many such methods have thus been proposed.Among them, a promising approach is to utilize linguisticinformation to compensate for the missing parallel supervi-sion [14, 15, 16, 17]; however, extra data or pretrained modelsare needed to derive such linguistic information.To remove such a requirement and solve non-parallelVC without any additional data or pretrained models, deepgenerative models, such as generative adversarial networks(GANs) [18] and variational autoencoders (VAEs) [19], havebeen introduced [20, 21, 22, 23, 24]. Among them, the familyof CycleGAN-VCs (CycleGAN-VC [22, 25], CycleGAN-VC2 [26], and StarGAN-VCs [27, 28, 29]) are signiﬁcantachievements and have been widely accepted as benchmarkapproaches (e.g., [17, 30, 31]). However, owing to their insuf-ﬁcient capacity to capture the time-frequency structure (e.g.,the harmonic structure is compromised, as shown in Figure1 in [32]), their application is limited to mel-cepstrum con-version and not mel-spectrogram conversion despite recentadvances in mel-spectrogram vocoders [33, 34, 35, 36, 37].To overcome this, CycleGAN-VC3 [32], an improvedvariant of CycleGAN-VC2, was recently proposed, and ad-dresses the problem by incorporating an additional modulecalled time-frequency adaptive normalization (TFAN). Al-though the performance is superior, an increase in the numberof converter parameters is necessary (from 16M to 27M).As an alternative, we propose

MaskCycleGAN-VC , whichis another extension of CycleGAN-VC2 and is trained usinga novel auxiliary task called ﬁlling in frames (FIF) . With FIF,we apply a temporal mask to the input mel-spectrogram andencourage the converter to ﬁll in the missing frames basedon the surrounding frames. FIF is inspired by the successof complementation-based self-supervised learning in otherﬁelds, e.g., image inpainting in computer vision [38] and textinﬁlling in natural language processing [39, 40]. Similarly,FIF allows the converter to learn the time-frequency featurestructure in a self-supervised manner through a complemen-tation process. This strong property eliminates the need foran additional module such as TFAN, and makes CycleGAN-VC2 applicable to mel-spectrogram conversion with negligi-bly small network modiﬁcations.We investigated the effectiveness of MaskCycleGAN-VCon the Spoke (i.e., non-parallel VC) task of the Voice Con-version Challenge 2018 (VCC 2018) [41]. A subjective eval- a r X i v : . [ c s . S D ] F e b ation of the naturalness and speaker similarity showed thatMaskCycleGAN-VC outperformed both CycleGAN-VC2and CycleGAN-VC3 while keeping the model size similar tothat of CycleGAN-VC2.The rest of this paper is organized as follows. In Sec-tion 2, we review CycleGAN-VC2, which is the baseline ofour model. We then introduce the proposed MaskCycleGAN-VC in Section 3. In Section 4, we describe the experimentalresults. Finally, we provide some concluding remarks and ar-eas of future study in Section 5.

2. CONVENTIONAL CYCLEGAN-VC2

The purpose of CycleGAN-VC2 is to train a converter G X → Y that translates source acoustic features x ∈ X into tar-get acoustic features y ∈ Y without parallel supervision.Following CycleGAN [42, 43, 44], which was proposedfor unpaired image-to-image translation, CycleGAN-VC2solves this problem using an adversarial loss [18], cycle-consistency loss [45], and identity-mapping loss [46]. In ad-dition, CycleGAN-VC2 uses a second adversarial loss [26]to improve the quality of the cyclically reconstructed features. Adversarial loss.

An adversarial loss L X → Yadv is used to makethe converted feature G X → Y ( x ) appear to be the target: L X → Yadv = E y ∼ P Y [log D Y ( y )]+ E x ∼ P X [log(1 − D Y ( G X → Y ( x )))] , (1)where the discriminator D Y distinguishes a real y fromthe generated G X → Y ( x ) by maximizing this loss, whereas G X → Y generates G X → Y ( x ) , which can deceive D Y by min-imizing this loss. Similarly, the inverse converter G Y → X istrained with the discriminator D X using L Y → Xadv . Cycle-consistency loss.

A cycle-consistency loss L X → Y → Xcyc is used to determine the pseudo pair within the cycle-consistencyconstraint without parallel supervision: L X → Y → Xcyc = E x ∼ P X [ (cid:107) G Y → X ( G X → Y ( x )) − x (cid:107) ] . (2)Similarly, L Y → X → Ycyc is used for the inverse-forward mapping(i.e., G X → Y ( G Y → X ( y )) ). Identity-mapping loss.

An identity-mapping loss L X → Yid isused to enhance the input preservation: L X → Yid = E y ∼ P Y [ (cid:107) G X → Y ( y ) − y (cid:107) ] . (3)Similarly, L Y → Xid is used for the inverse converter G Y → X . Second adversarial loss.

A second adversarial loss L X → Y → Xadv is used to mitigate the statistical averaging caused by L1 lossin Eq. 2. L X → Y → Xadv = E x ∼ P X [log D (cid:48) X ( x )]+ E x ∼ P X [log(1 − D (cid:48) X ( G Y → X ( G X → Y ( x ))))] , (4)where the discriminator D (cid:48) X distinguishes a reconstructed G Y → X ( G X → Y ( x )) from a real x . Similarly, L Y → X → Yadv is used for the inverse-forward mapping with an additionaldiscriminator D (cid:48) Y . xx m ym Cycle-consistency lossMissing frames Adversarialloss SecondadversariallossInput Converted ReconstructedForwardconversion Inverseconversion ˆ x = x · m G maskX → Y G maskY → X Fig. 1 . Pipeline of FIF for the forward-inverse mapping. Weencourage the converter to ﬁll in the missing frames (sur-rounded by the red box) based on the surrounding framesthrough a cyclic conversion process. In practice, a similarprocedure is used for the inverse-forward mapping.

Full objective.

A full objective L full is written as follows: L full = L X → Yadv + L Y → Xadv + λ cyc ( L X → Y → Xcyc + L Y → X → Ycyc )+ λ id ( L X → Yid + L Y → Xid ) + L X → Y → Xadv + L Y → X → Yadv , (5)where λ cyc and λ id are weighing parameters. G X → Y and G Y → X are optimized by minimizing this loss, whereas D X , D Y , D (cid:48) X , and D (cid:48) Y are optimized by maximizing this loss.

3. MASKCYCLEGAN-VC3.1. Training with Filling in Frames (FIF)

As shown in [32], CycleGAN-VC2, which was developed formel-cepstrum conversion, does not have sufﬁcient ability tocapture the time-frequency structure in mel-spectrogram con-version; consequently, the harmonic structure is often com-promised. To alleviate this, we devised

MaskCycleGAN-VC ,which is trained using the auxiliary

FIF task. We present theoverall pipeline of FIF in Fig. 1.Given the source mel-spectrogram x , we ﬁrst create atemporal mask m ∈ M , which has the same size as x , partsof which have a value of zero (denoted by the black region inFig. 1), and the remaining parts have a value of 1 (indicatedby the white region in Fig. 1). A masked region (i.e., zero re-gion) is randomly determined based on a predetermined rule(the effect of which is examined in Section 4.2).Subsequently, we apply the mask m to x as follows: ˆ x = x · m , (6)where · represents an element-wise product. By using thisprocedure, we artiﬁcially create missing frames, as shown inthe region surrounded by the red box in Fig. 1.Next, the MaskCycleGAN-VC converter G maskX → Y synthe-sizes y (cid:48) from ˆ x and m as follows: y (cid:48) = G maskX → Y (concat( ˆ x , m )) , (7)where concat denotes the channel-wise concatenation. Byusing m as the conditional information, G maskX → Y can ﬁll in theframes while knowing which frames need to be ﬁlled in.Similar to CycleGAN-VC2, we can ensure that y (cid:48) is inthe target Y by using an adversarial loss (Eq. 1) but cannotcompare y (cid:48) with the ground truth directly owing to the lackf parallel supervision. As an alternative, we aim to ﬁll inthe frames through a cyclic conversion process. To do so, wereconstruct x (cid:48)(cid:48) using the inverse converter G maskY → X : x (cid:48)(cid:48) = G maskY → X (concat( y (cid:48) , m (cid:48) )) , (8)where m (cid:48) is represented using an all-ones matrix under theassumption that the missing frames have been ﬁlled in aheadof this process. We then apply the cycle-consistency loss forthe original and reconstructed mel-spectrograms: L X → Y → Xmcyc = E x ∼ P X , m ∼ P M [ (cid:107) x (cid:48)(cid:48) − x (cid:107) ] , (9)where we simultaneously used a second adversarial loss(Eq. 4) for x (cid:48)(cid:48) .To optimize L X → Y → Xmcyc , G maskX → Y needs to derive informa-tion useful for ﬁlling in the missing frames from the surround-ing frames. This induction is useful for learning the time-frequency structure in a mel-spectrogram in a self-supervisedmanner. Note that similar effects have been observed forsimilar tasks in other ﬁelds (e.g., image inpainting [38] andtext inﬁlling [39, 40]), as mentioned in Section 1. Finally, itshould be noted that (1) unlike CycleGAN-VC3, which usesTFAN, MaskCycleGAN-VC does not need a large increasein the converter parameters (only the input channels are dou-bled to receive m along with ˆ x ), and (2) FIF is a type ofself-supervised learning; therefore, neither extra data nor apretrained model (e.g., linguistic information) is required. As a remaining question, what mask should be used duringthe conversion process (i.e., test phase)? For this question,we simply use an all-ones mask. Thus, we can convert speechunder the assumption that no missing frames exist. This as-sumption is the same as that used in typical VC.

4. EXPERIMENTS4.1. Experimental conditionsDataset.

We examined the effectiveness of MaskCycleGAN-VC on the Spoke (i.e., non-parallel VC) task of VCC 2018 [41],which contains recordings of native speakers of AmericanEnglish. We used a subset of speakers that covers all inter-and intra-gender VC, i.e., VCC2SF3 ( SF ), VCC2SM3 ( SM ),VCC2TF1 ( TF ), and VCC2TM1 ( TM ), where S , T , F , and M indicate the sources, targets, females, and males, respec-tively. We used combinations of 2 sources × Conversion and synthesis process.

For a fair comparisonwith CycleGAN-VC3 [32], we used the same conversion and synthesis process as CycleGAN-VC3. Namely, we ap-plied MaskCycleGAN-VC to mel-spectrogram conversionand synthesized the waveform using the pretrained MelGANvocoder [35]. Although for a fair comparison we did notchange the parameters of the vocoder, ﬁne-tuning it for eachspeaker is acceptable.

Network architectures.

We used similar network architec-tures as in CycleGAN-VC2 for mel-spectrogram conversion,which was used as the baseline in the study on CycleGAN-VC3 [32] (see Figure 4 in [26] and Section 4.1 in [32] forthe details). The converter consists of a 2-1-2D CNN [26],and the discriminator is PatchGAN [47]. As mentioned inSection 3.1, the only difference between CycleGAN-VC2 andMaskCycleGAN-VC is that the input channels are doubled inthe converter to receive m along with ˆ x . Training settings.

We used the same training settings asin CycleGAN-VC3 [32]. During the preprocessing, we nor-malized the mel-spectrograms using the training set statistics.We used a least-squares GAN [48] as the GAN objective.We trained the networks for k iterations using an Adamoptimizer [49], with the learning rates of the converter anddiscriminator set to 0.0002 and 0.0001, respectively, undermomentum terms β and β of 0.5 and 0.999, respectively.The batch size was set to 1, where each training sample con-sisted of 64 randomly cropped frames (approximately 0.75 sin length). λ cyc and λ id were set to 10 and 5, respectively, and L id was used for only the ﬁrst k iterations to prevent L id from disturbing the learning of conversion. Similar to the pre-vious CycleGAN-VCs, we did not use extra data, pretrainedmodels, or a time alignment procedure for training . We conducted an objective evaluation to examine the dif-ferences in performance when using different components.Because a direct comparison between the converted andtarget mel-spectrograms is difﬁcult owing to the lack of acorrect alignment, we used two metrics: (1) mel-cepstraldistortion (MCD) , which is the most commonly appliedmeasure and calculates the distance within the mel-cepstraldomain (particularly, a 35-dimensional mel-cepstrum wasextracted from the converted or targeted waveform using theWORLD analyzer [50]), and (2)

Kernel DeepSpeech Distance(KDSD) [51], which computes the maximum mean discrep-ancy within the DeepSpeech2 feature space [52] and is shownto be well correlated with human judgement [51]. For bothmetrics, the smaller the value, the better the performance.

Comparison among different-sized masks.

We ﬁrst exam-ined the effect of the mask size selection. Here, the mask sizeindicates the size of the zero region (i.e., the black region inFig. 1). We tested two variations. (1)

FIF X : The mask sizeis constantly X % (i.e., × X frames). Here, FIF 0 meansthat an all-ones mask is used. (2)

FIF 0-X : The mask size israndomly determined within the range of [0 , X %] . We list theresults in Table 1(a). We found that (i) FIF with a non-zero-sized mask (Nos. 2–5) outperformed that with a zero-sizedmask (No. 1) regardless of the mask size, (ii) the performance https://github.com/descriptinc/melgan-neurips able 1 . Comparison of MCD and KDSD using (a) different-sized masks, (b) different types of masks, and (c) differentCycleGAN-VCs. The results are listed as MCD [dB]/KDSD[ × ]. Bold numbers indicate the best scores. No. (a) Size SF-TF SM-TM SF-TM SM-TF /

146 7.64 / /89.2 /169 7.66/546 16MNo. (b) Type SF-TF SM-TM SF-TM SM-TF /

467 6.77 / /

146 7.64 / NS /467 / /

146 7.64 / is affected by the mask size (Nos. 3–5) and maximizes at ap-proximately X = 50 , and (iii) FIF with a random-sized mask(No. 4) outperformed FIF with a constant-sized mask (No.2) despite the same average size. The possible reason is that,during training, the former includes an all-ones mask, whichis used in the test phase, whereas the latter does not. Comparison among different types of masks.

We inspectthe effect of the mask type selection. We compared fourvariations. (1)

FIF : Subsequent frames are masked, as shownin Fig. 1. (2)

FIF NS : Non-subsequent frames (i.e., eachframe is independently and randomly selected) are masked.(3) FIS : Subsequent spectrum bands (e.g., 45th–60th mel-spectrograms) are masked. (4)

FIP : Mel-spectrogram wasmasked in a point-wise manner similar to a dropout [53].Under all settings, we used a mask size of , which wasthe best setting in the previous experiment. We summarizethe results in Table 1(b). We found that

FIF (No. 6) out-performed the others (Nos. 7–9) for all speaker pairs. Weconsider that, although learning the temporal structure is themost difﬁcult, it is important for CycleGAN-VC2, and

FIF isthe most effective in mitigating this difﬁculty.

Comparison among CycleGAN-VCs.

We examined thedifferences in performance among (1) MaskCycleGAN-VC(

Mask , particularly

FIF 0-50 , was used); (2) CycleGAN-VC2 [26] ( V2 ), which was the same as Mask except FIF wasnot used; and (3) CycleGAN-VC3 [32] ( V3 ), which appliedTFAN instead of FIF. The results are listed in Table 1(c). Wefound that Mask (No. 10) outperformed both V2 (No. 11) and V3 (No. 12) in most cases, reducing the model size comparedto V3 . Further evidence is provided in the next section. We conducted listening tests to investigate the differences inperceptual quality. As the benchmark performance of mel-spectrogram conversion based on CycleGAN-VCs was previ-ously examined in [32], we investigated the comparative per-formance between

Mask and V2 and that between Mask and V3 using two forced-choice preference tests. In the AB test V2 Mask V2 Mask V2 MaskAll Intra-gender Inter-gender All Intra-gender Inter-genderV3 Mask V3 Mask V3 Mask (3.10e-28) (6.42e-16) (5.49e-14) (3.87e-7) (1.01e-4) (7.17e-4) P r e f e r e n ce s c o r e [ % ] (p-value) Fig. 2 . Average preference scores on naturalness with conﬁdence intervals. The numbers in parentheses indicate thep-values computed using a one-tailed binomial test.

V2 Mask V2 Mask V2 MaskAll Intra-gender Inter-gender All Intra-gender Inter-genderV3 Mask V3 Mask V3 Mask (2.07e-14) (6.08e-9) (4.46e-7) (1.58e-4) (7.93e-4) (3.03e-2) P r e f e r e n ce s c o r e [ % ] (p-value) Fig. 3 . Average preference scores on speaker similarity with conﬁdence intervals. The numbers in parentheses denotethe p-values calculated using a one-tailed binomial test.on naturalness, each listener was presented with two speechsamples (A and B) and asked to choose their preferred one (Aor B) considering both naturalness and intelligibility. In theXAB test on speaker similarity, each listener was presentedwith three speech samples, including comparison targets (Aand B) and a reference with a different utterance (X), andasked to choose their preferred one (A or B) with speakercharacteristics closer to that of X. These tests were conductedonline, and 15 and 16 listeners participated in the AB andXAB tests, respectively. Sentences, comparison targets, andthe compared order (AB or BA) were randomly chosen fromthe collection of speech samples. We gathered at least 300 an-swers for each model pair. Audio samples are available fromthe link presented in the ﬁrst page.We show the results of the AB test on naturalness and theXAB test on speaker similarity in Figs. 2 and 3, respectively.We found that in both tests, Mask achieved statistically sig-niﬁcantly better scores than V2 and V3 with a p-value of <

5. CONCLUSIONS

Motivated by recent advances in mel-spectrogram vocoders,we proposed MaskCycleGAN-VC, which is an improve-ment of CycleGAN-VC2 for mel-spectrogram conversion.To learn the time-frequency structure in a mel-spectrogramwithout an additional module such as TFAN, we introducedFIF, which allows the converter to learn such a structure ina self-supervised manner. The experimental results showedthat MaskCycleGAN-VC outperformed both CycleGAN-VC2 and CycleGAN-VC3 while maintaining a model sizesimilar to that of CycleGAN-VC2. Examining the general-ity of FIF is an interesting research topic, and future workincludes applications to multi-domain VC [27, 28, 29] andapplication-side VC [1, 2, 3, 4, 5, 6].

Acknowledgements:

This work was supported by JSPSKAKENHI 17H01763 and JST CREST Grant Number JP-MJCR19A3, Japan. . REFERENCES [1] Alexander B Kain, John-Paul Hosom, Xiaochuan Niu, Jan P. H. van Santen,Melanie Fried-Oken, and Janice Staehely, “Improving the intelligibility ofdysarthric speech,”

Speech Commun. , vol. 49, no. 9, pp. 743–759, 2007.[2] Keigo Nakamura, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano,“Speaking-aid systems using GMM-based voice conversion for electrolaryngealspeech,”

Speech Commun. , vol. 54, no. 1, pp. 134–146, 2012.[3] Zeynep Inanoglu and Steve Young, “Data-driven emotion conversion in spokenEnglish,”

Speech Commun. , vol. 51, no. 3, pp. 268–283, 2009.[4] Tomoki Toda, Mikihiro Nakagiri, and Kiyohiro Shikano, “Statistical voice conver-sion techniques for body-conducted unvoiced speech enhancement,”

IEEE Trans.Audio Speech Lang. Process. , vol. 20, no. 9, pp. 2505–2517, 2012.[5] Daniel Felps, Heather Bortfeld, and Ricardo Gutierrez-Osuna, “Foreign accentconversion in computer assisted pronunciation training,”

Speech Commun. , vol.51, no. 10, pp. 920–932, 2009.[6] Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, and Kunio Kashino,“Sequence-to-sequence voice conversion with similarity metric learned using gen-erative adversarial networks,” in

Proc. Interspeech , 2017, pp. 1283–1287.[7] Yannis Stylianou, Olivier Capp´e, and Eric Moulines, “Continuous probabilistictransform for voice conversion,”

IEEE Trans. Speech Audio Process. , vol. 6, no.2, pp. 131–142, 1998.[8] Tomoki Toda, Alan W Black, and Keiichi Tokuda, “Voice conversion based onmaximum-likelihood estimation of spectral parameter trajectory,”

IEEE Trans.Audio Speech Lang. Process. , vol. 15, no. 8, pp. 2222–2235, 2007.[9] Srinivas Desai, Alan W Black, B Yegnanarayana, and Kishore Prahallad, “Spec-tral mapping using artiﬁcial neural networks for voice conversion,”

IEEE Trans.Audio Speech Lang. Process. , vol. 18, no. 5, pp. 954–964, 2010.[10] Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng, “Voice conversion using deepbidirectional long short-term memory based recurrent neural networks,” in

Proc.ICASSP , 2015, pp. 4869–4873.[11] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, and Li-Rong Dai,“Sequence-to-sequence acoustic modeling for voice conversion,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 27, no. 3, pp. 631–644, 2019.[12] Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, and Nobukatsu Hojo,“AttS2S-VC: Sequence-to-sequence voice conversion with attention and contextpreservation mechanisms,” in

Proc. ICASSP , 2019, pp. 6805–6809.[13] Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko, and Nobukatsu Hojo,“ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion,”

IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 28, pp. 1849–1863, 2020.[14] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and Helen Meng, “Phonetic posterior-grams for many-to-one voice conversion without parallel data training,” in

Proc.ICME , 2016, pp. 1–6.[15] Feng-Long Xie, Frank K Soong, and Haifeng Li, “A KL divergence and DNN-based approach to voice conversion without parallel training sentences,” in

Proc.Interspeech , 2016, pp. 287–291.[16] Yuki Saito, Yusuke Ijima, Kyosuke Nishida, and Shinnosuke Takamichi, “Non-parallel voice conversion using variational autoencoders conditioned by phoneticposteriorgrams and d-vectors,” in

Proc. ICASSP , 2018, pp. 5274–5278.[17] Jingxuan Zhang, Zhenhua Ling, and Li-Rong Dai, “Non-parallel sequence-to-sequence voice conversion with disentangled linguistic and speaker representa-tions,”

IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 28, pp. 540–552,2020.[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adver-sarial nets,” in

Proc. NIPS , 2014, pp. 2672–2680.[19] Diederik P Kingma and Max Welling, “Auto-encoding variational Bayes,” in

Proc. ICLR , 2014.[20] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang,“Voice conversion from non-parallel corpora using variational auto-encoder,” in

Proc. APSIPA ASC , 2016, pp. 1–6.[21] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang,“Voice conversion from unaligned corpora using variational autoencoding Wasser-stein generative adversarial networks,” in

Proc. Interspeech , 2017, pp. 3364–3368.[22] Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-data-free voice conversionusing cycle-consistent adversarial networks,” arXiv preprint arXiv:1711.11293 ,Nov. 2017.[23] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo,“ACVAE-VC: Non-parallel voice conversion with auxiliary classiﬁer variationalautoencoder,”

IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 27, no. 9, pp.1432–1443, 2019.[24] Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi,and Tomoki Toda, “Non-parallel voice conversion with cyclic variational autoen-coder,” in

Proc. Interspeech , 2019, pp. 674–678.[25] Takuhiro Kaneko and Hirokazu Kameoka, “CycleGAN-VC: Non-parallel voiceconversion using cycle-consistent adversarial networks,” in

Proc. EUSIPCO ,2018, pp. 2114–2118.[26] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo,“CycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion,” in

Proc. ICASSP , 2019, pp. 6820–6824. [27] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo,“StarGAN-VC: Non-parallel many-to-many voice conversion using star genera-tive adversarial networks,” in

Proc. SLT , 2018, pp. 266–273.[28] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo,“StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice con-version,” in

Proc. Interspeech , 2019, pp. 679–683.[29] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, “Non-parallel voice conversion with augmented classiﬁer star generative adversarial net-works,”

IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 28, pp. 2982–2995,2020.[30] Shindong Lee, BongGu Ko, Keonnyeong Lee, In-Chul Yoo, and Dongsuk Yook,“Many-to-many voice conversion using conditional cycle-consistent adversarialnetworks,” in

Proc. ICASSP , 2020, pp. 6279–6283.[31] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson, “Auto-VC: Zero-shot voice style transfer with only autoencoder loss,”in

Proc. ICML , 2019, pp. 5210–5219.[32] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo,“CycleGAN-VC3: Examining and improving CycleGAN-VCs for mel-spectrogram conversion,” in

Proc. Interspeech , 2020, pp. 2017–2021.[33] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly,Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerrv-Ryan, Rif A.Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu, “Natural TTS synthesis byconditioning WaveNet on mel spectrogram predictions,” in

Proc. ICASSP , 2018,pp. 4779–4783.[34] Ryan Prenger, Rafael Valle, and Bryan Catanzaro, “WaveGlow: A ﬂow-basedgenerative network for speech synthesis,” in

Proc. ICASSP , 2019, pp. 3617–3621.[35] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei ZhenTeoh, Jose Sotelo, Alexandre de Br´ebisson, Yoshua Bengio, and Aaron Courville,“MelGAN: Generative adversarial networks for conditional waveform synthesis,”in

Proc. NeurIPS , 2019, pp. 14881–14892.[36] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim, “Parallel WaveGAN: Afast waveform generation model based on generative adversarial networks withmulti-resolution spectrogram,” in

Proc. ICASSP , 2020, pp. 6199–6203.[37] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, andWilliam Chan, “WaveGrad: Estimating gradients for waveform generation,” arXivpreprint arXiv:2009.00713 , Sept. 2020.[38] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei AEfros, “Context Encoders: Feature learning by inpainting,” in

Proc. CVPR , 2016,pp. 2536–2544.[39] William Fedus, Ian Goodfellow, and Andrew M Dai, “MaskGAN: Better textgeneration via ﬁlling in the ,” in

Proc. ICLR , 2018.[40] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT:Pre-training of deep bidirectional transformers for language understanding,” in

Proc. NAACL , 2019, pp. 4171–4186.[41] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fer-nando Villavicencio, Tomi Kinnunen, and Zhenhua Ling, “The voice conversionchallenge 2018: Promoting development of parallel and nonparallel methods,” in

Proc. Odyssey , 2018, pp. 195–202.[42] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in

Proc. ICCV ,2017, pp. 2223–2232.[43] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong, “DualGAN: Unsupervised duallearning for image-to-image translation,” in

Proc. ICCV , 2017, pp. 2849–2857.[44] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim,“Learning to discover cross-domain relations with generative adversarial net-works,” in

Proc. ICML , 2017, pp. 1857–1865.[45] Tinghui Zhou, Philipp Kr¨ahenb¨uhl, Mathieu Aubry, Qixing Huang, and Alexei AEfros, “Learning dense correspondence via 3D-guided cycle consistency,” in

Proc. CVPR , 2016, pp. 117–126.[46] Yaniv Taigman, Adam Polyak, and Lior Wolf, “Unsupervised cross-domain imagegeneration,” in

Proc. ICLR , 2017.[47] Chuan Li and Michael Wand, “Precomputed real-time texture synthesis withMarkovian generative adversarial networks,” in

Proc. ECCV , 2016, pp. 702–716.[48] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, andStephen Paul Smolley, “Least squares generative adversarial networks,” in

Proc.ICCV , 2017, pp. 2794–2802.[49] Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”in

Proc. ICLR , 2015.[50] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,”

IEICETrans. Inf. Syst. , vol. 99, no. 7, pp. 1877–1884, 2016.[51] Mikołaj Bi´nkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen,Norman Casagrande, Luis C Cobo, and Karen Simonyan, “High ﬁdelity speechsynthesis with adversarial networks,” in

Proc. ICLR , 2020.[52] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai,Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guo-liang Chen, et al., “Deep Speech 2: End-to-end speech recognition in English andMandarin,” in

Proc. ICML , 2016, pp. 173–182.[53] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov, “Dropout: A simple way to prevent neural networks from overﬁt-ting,”