Nonparallel Voice Conversion with Augmented Classifier Star Generative Adversarial Networks
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo
aa r X i v : . [ ee ss . A S ] O c t Non-Parallel Voice Conversion with AugmentedClassifier Star Generative Adversarial Networks
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo
Abstract —We previously proposed a method that allows fornon-parallel voice conversion (VC) by using a variant of gener-ative adversarial networks (GANs) called StarGAN. The mainfeatures of our method, called StarGAN-VC, are as follows:First, it requires no parallel utterances, transcriptions, or timealignment procedures for speech generator training. Second, itcan simultaneously learn mappings across multiple domains usinga single generator network so that it can fully exploit availabletraining data collected from multiple domains to capture latentfeatures that are common to all the domains. Third, it is ableto generate converted speech signals quickly enough to allowreal-time implementations and requires only several minutesof training examples to generate reasonably realistic-soundingspeech. In this paper, we describe three formulations of StarGAN,including a newly introduced novel StarGAN variant called“Augmented classifier StarGAN (A-StarGAN)”, and comparethem in a non-parallel VC task. We also compare them withseveral baseline methods.
Index Terms —Voice conversion (VC), non-parallel VC, multi-domain VC, generative adversarial networks (GANs), CycleGAN,StarGAN, A-StarGAN.
I. I
NTRODUCTION
Voice conversion (VC) is a task of converting the voice ofa source speaker without changing the uttered sentence. Ex-amples of the applications of VC techniques include speaker-identity modification [1], speaking assistance [2], [3], speechenhancement [4]–[6], bandwidth extension [7], and accentconversion [8].One successful VC framework involves a Gaussian mixturemodel (GMM)-based approach [9]–[11], which utilizes acous-tic models represented by GMMs for feature mapping. Re-cently, a neural network (NN)-based framework [12]–[30] andan exemplar-based framework based on non-negative matrixfactorization (NMF) [31]–[33] have also proved successful.Many conventional VC methods including those mentionedabove require accurately aligned parallel source and targetspeech data. However, in many scenarios, it is not alwayspossible to collect parallel utterances. Even if we could collectsuch data, we typically need to perform time alignment proce-dures, which becomes relatively difficult when there is a largeacoustic gap between the source and target speech. Since manyframeworks are weak as regards the misalignment found withparallel data, careful pre-screening and manual correction maybe required to make these frameworks work reliably. To bypassthese restrictions, this paper is concerned with developing a
H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo are withNTT Communication Science Laboratories, Nippon Telegraph and Tele-phone Corporation, Atsugi, Kanagawa, 243-0198 Japan (e-mail: [email protected]). non-parallel VC method, which requires no parallel utterances,transcriptions, or time alignment procedures.Recently, some attempts have been made to develop non-parallel methods [17]–[30]. For example, a method usingautomatic speech recognition (ASR) was proposed in [24].The idea is to convert input speech under the restriction thatthe posterior state probability of the acoustic model of an ASRsystem is preserved so that the transcription of the convertedspeech becomes consistent with that of the input speech. Sincethe performance of this method depends heavily on the qualityof the acoustic model of ASR, it can fail to work if ASR doesnot function reliably. A method using i-vectors [34], knownas a feature for speaker verification, was recently proposed in[25]. Conceptually, the idea is to shift the acoustic features ofinput speech towards target speech in the i-vector space so thatthe converted speech is likely to be recognized as the targetspeaker by a speaker recognizer. While this method is alsofree from parallel data, one limitation is that it is applicableonly to speaker identity conversion tasks.Recently, a framework based on conditional variationalautoencoders (CVAEs) [35], [36] was proposed in [22], [29],[30]. As the name implies, variational autoencoders (VAEs) area probabilistic counterpart of autoencoders (AEs), consistingof encoder and decoder networks. CVAEs [36] are an extendedversion of VAEs where the encoder and decoder networks cantake a class indicator variable as an additional input. By usingacoustic features as the training examples and the associateddomain class labels, the networks learn how to convert sourcespeech to a target domain according to the domain class labelfed into the decoder. This CVAE-based VC approach is notablein that it is completely free from parallel data and works evenwith unaligned corpora. However, one well-known problemas regards VAEs is that outputs from the decoder tend to beoversmoothed. For VC applications, this can be problematicsince it usually results in poor quality buzzy-sounding speech.One powerful framework that can potentially overcome theweakness of VAEs involves generative adversarial networks(GANs) [37]. GANs offer a general framework for training agenerator network so that it can generate fake data samples thatcan deceive a real/fake discriminator network in the form of aminimax game. While they have been found to be effective foruse with image generation, in recent years they have also beenemployed with notable success for various speech processingtasks [16], [38]–[42]. We previously reported a non-parallelVC method using a GAN variant called cycle-consistent GAN(CycleGAN) [26], which was originally proposed as a methodfor translating images using unpaired training examples [43]–[45]. Although this method, which we call CycleGAN-VC, was shown to work reasonably well, one major limitation isthat it only learns mappings between a single pair of domains.In many VC application scenarios, it is desirable to be ableto convert speech into multiple domains, not just one. Onenaive way of applying CycleGAN to multi-domain VC taskswould be to prepare and train a different mapping pair foreach domain pair. However, this can be ineffective since eachmapping pair fails to use the training data of the other domainsfor learning, even though there must be a common set of latentfeatures that can be shared across different domains.To overcome the shortcomings and limitations of CVAE-VC[22] and CycleGAN-VC [26], we previously proposed a non-parallel VC method [46] using another GAN variant calledStarGAN [47], which offers the advantages of CVAE-VC andCycleGAN-VC concurrently. Unlike CycleGAN-VC and aswith CVAE-VC, this method, called StarGAN-VC, is capableof simultaneously learning multiple mappings using a singlegenerator network so that it can fully use available trainingdata collected from multiple domains. Unlike CVAE-VC andas with CycleGAN-VC, StarGAN-VC uses an adversarial lossfor generator training to encourage the generator outputs tobecome indistinguishable from real speech. It is also notewor-thy that unlike CVAE-VC and CycleGAN-VC, StarGAN-VCdoes not require any information about the domain of the inputspeech at test time.The remainder of this paper is organized as follows. Afterreviewing other related work in Section II, we briefly describethe formulation of CycleGAN-VC in Section III, presentthree formulations of StarGAN-VC in Section IV and showexperimental results in Section V.II. R
ELATED W ORK
Other natural ways of overcoming the weakness of VAEsinclude the VAE-GAN framework [48]. A non-parallel VCmethod based on this framework has already been proposedin [23]. With this approach, an adversarial loss derived using aGAN discriminator is incorporated into the training loss to en-courage the decoder outputs of a CVAE to be indistinguishablefrom real speech features. Although the concept is similar toour StarGAN-VC approach, we will show in Section V thatour approach outperforms this method in terms of both thespeech quality and conversion effect.Another related technique worth noting is the vector quan-tized VAE (VQ-VAE) approach [27], which has performedimpressively in non-parallel VC tasks. This approach is par-ticularly notable in that it offers a novel way of overcomingthe weakness of VAEs by using the WaveNet model [49], asample-by-sample neural signal generator, to devise both theencoder and decoder of a discrete counterpart of CVAEs. Theoriginal WaveNet model is a recursive model that makes itpossible to predict the distribution of a sample conditioned onthe samples the generator has produced. While a faster version[50] has recently been proposed, it typically requires hugecomputational cost to generate a stream of samples, which cancause difficulties when implementing real-time systems. Themodel is also known to require a huge number of trainingexamples to be able to generate natural-sounding speech. By contrast, our method is noteworthy in that it is able to generatesignals quickly enough to allow real-time implementation andrequires only several minutes of training examples to generatereasonably realistic-sounding speech.Meanwhile, given the recent success of the sequence-to-sequence (S2S) learning framework in various tasks, severalVC methods based on S2S models have been proposed,including the ones we proposed previously [51]–[54]. WhileS2S models usually require parallel corpora for training, anattempt has also been made to train an S2S model usingnon-parallel utterances [55]. However, it requires phonemetranscriptions as auxiliary information for model training.III. C
YCLE
GAN V
OICE C ONVERSION
Since StarGAN-VC is an extension of CycleGAN-VC,which we also proposed previously [26], we start by brieflyreviewing its formulation (Fig. 1).Let x ∈ R Q × N and y ∈ R Q × M be acoustic featuresequences of speech belonging to domains X and Y , respec-tively, where Q is the feature dimension and N and M arethe lengths of the sequences. In the following, we will restrictour attention to speaker identity conversion tasks, so whenwe use the term domain, we will mean speaker. The aimof CycleGAN-VC is to learn a mapping G that converts thedomain of x into Y and a mapping F that does the opposite.Now, we introduce discriminators D X and D Y , whose rolesare to predict whether or not their inputs are the acousticfeatures of real speech belonging to X and Y , and define L D Y adv ( D Y ) = − E y ∼ p Y ( y ) [log D Y ( y )] − E x ∼ p X ( x ) [log(1 − D Y ( G ( x )))] , (1) L G adv ( G ) = E x ∼ p X ( x ) [log(1 − D Y ( G ( x )))] , (2) L D X adv ( D X ) = − E x ∼ p X ( x ) [log D X ( x )] − E y ∼ p Y ( y ) [log(1 − D X ( F ( y )))] , (3) L F adv ( F ) = E y ∼ p Y ( y ) [log(1 − D X ( F ( y )))] , (4)as the adversarial losses for D Y , G , D X and F , respectively. L D Y adv ( D Y ) and L D X adv ( D X ) measure how indistinguishable G ( x ) and F ( y ) are from acoustic features of real speechbelonging to Y and X . Since the goal of D X and D Y is tocorrectly distinguish the converted feature sequences obtainedvia G and F from real speech feature sequences, D X and D Y attempt to minimize these losses to avoid being fooled by G and F . Conversely, since one of the goals of G and F isto generate realistic-sounding speech that is indistinguishablefrom real speech, G and F attempt to maximize these lossesor minimize L G adv ( G ) and L F adv ( F ) to fool D Y and D X . Itcan be shown that the output distributions of G and F trainedin this way will match the empirical distributions p Y ( y ) and p X ( x ) if G , F , D X , and D Y have enough capacity [37],[43]. Note that since L G adv ( G ) and L F adv ( F ) are minimizedwhen D Y ( G ( x )) = 1 and D X ( F ( y )) = 1 , we can also use − E x ∼ p X ( x ) [log D Y ( G ( x ))] and − E x ∼ p X ( x ) [log D Y ( G ( x ))] as the adversarial losses for G and F .As mentioned above, training G and F using the adversariallosses enables mappings G and F to produce outputs iden-tically distributed as target domains Y and X , respectively. However, using them alone does not guarantee that G or F will preserve the linguistic contents of input speech since thereare infinitely many mappings that will induce the same outputdistributions. One way to let G and F preserve the linguisticcontents of input speech would be to encourage them to makeonly minimal changes from the inputs. To incentivize thisbehaviour, we introduce a cycle consistency loss [43]–[45] L cyc ( G, F ) = E x ∼ p X ( x ) [ k F ( G ( x )) − x k ρρ ]+ E y ∼ p Y ( y ) [ k G ( F ( y )) − y k ρρ ] , (5)to enforce F ( G ( x )) ≃ x and G ( F ( y )) ≃ y . In image-to-image translation tasks, this regularization loss contributes toenabling G and F to change only the textures and colorsof input images while preserving the domain-independentcontents. However, it was non-trivial what effect this losswould have on VC tasks. Our previous work [26] was amongthe first to show that it enables G and F to change onlythe voice characteristics of input speech while preserving thelinguistic content. This regularization technique has recentlyproved effective also in the VAE-based VC methods [56]. Withthe same motivation, we also consider an identity mapping loss L id ( G, F ) = E x ∼ p X ( x ) [ k F ( x ) − x k ρρ ]+ E y ∼ p Y ( y ) [ k G ( y ) − y k ρρ ] , (6)to ensure that inputs to G and F are kept unchanged whenthe inputs already belong to Y and X . The full objectives ofCycleGAN-VC to be minimized with respect to G , F , D X and D Y are thus given as I G,F ( G, F ) = λ adv L G adv ( G ) + λ adv L F adv ( F )+ λ cyc L cyc ( G, F ) + λ id L id ( G, F ) , (7) I D ( D X , D Y ) = L D X adv ( D X ) + L D Y adv ( D Y ) , (8)where λ adv ≥ , λ cyc ≥ , and λ id ≥ are regularizationparameters, which weigh the importance of the adversarial,cycle consistency, and identity mapping losses. In practice,we alternately update each of G , F , D X , and D Y once at atime while keeping the others fixed.IV. S TAR
GAN V
OICE C ONVERSION
While CycleGAN-VC can only learn mappings between asingle pair of speech domains, StarGAN-VC [46] can learnmappings among multiple speech domains using a singlegenerator network, thus allowing us to fully utilize availabletraining data collected from multiple domains. In this section,we describe three formulations of StarGAN. While the firstand second formulations respectively correspond to the onespresented in [46] and [47], the third formulation is newlyproposed in this paper with the aim of further improving theformer two formulations.
A. Cross-Entropy StarGAN formulation
First, we describe the formulation we introduced in [46].Let G be a generator that takes an acoustic feature sequence x ∈ R Q × N belonging to an arbitrary domain and a targetdomain class index k ∈ { , . . . , K } as the inputs and generates an acoustic feature sequence ˆ y = G ( x , k ) . For example, ifwe consider speaker identities as the domain classes, each k will be associated with a different speaker. One of the goalsof StarGAN-VC is to make ˆ y = G ( x , k ) as realistic as realspeech features and belong to domain k . To achieve this, weintroduce a real/fake discriminator D as with CycleGAN anda domain classifier C , whose role is to predict to which classesan input belongs. D is designed to produce a probability D ( y , k ) that an input y is a real speech feature whereas C isdesigned to produce class probabilities p C ( k | y ) of y . Adversarial Loss:
First, we define L D adv ( D ) = − E k ∼ p ( k ) , y ∼ p d ( y | k ) [log D ( y , k )] − E k ∼ p ( k ) , x ∼ p d ( x ) [log(1 − D ( G ( x , k ) , k ))] , (9) L G adv ( G ) = − E k ∼ p ( k ) , x ∼ p d ( x ) [log D ( G ( x , k ) , k )] , (10)as adversarial losses for discriminator D and generator G ,respectively, where p ( k ) = K (a uniform categorical dis-tribution), y ∼ p d ( y | k ) denotes a training example of anacoustic feature sequence of real speech in domain k , and x ∼ p d ( x ) denotes that in an arbitrary domain. L D adv ( D ) takes a small value when D correctly classifies G ( x , k ) and y as fake and real speech features whereas L G adv ( G ) takes asmall value when G successfully deceives D so that G ( x , k ) is misclassified as real speech features by D . Thus, we wouldlike to minimize L D adv ( D ) with respect to D and minimize L G adv ( G ) with respect to G . Note that E k ∼ p ( k ) [ · ] is a simplifiednotation for K P Kk =1 ( · ) , and when k denotes a speaker index, E y ∼ p d ( y | k ) [ · ] and E x ∼ p d ( x ) [ · ] can be approximated as thesample means over the training examples of speaker k andall speakers, respectively. Domain Classification Loss:
Next, we define L C cls ( C ) = − E k ∼ p ( k ) , y ∼ p d ( y | k ) [log p C ( k | y )] , (11) L G cls ( G ) = − E k ∼ p ( k ) , x ∼ p ( x ) [log p C ( k | G ( x , k ))] , (12)as domain classification losses for classifier C and generator G . L C cls ( C ) and L G cls ( G ) take small values when C correctlyclassifies y ∼ p d ( y | k ) and G ( x , k ) as belonging to domain k .Thus, we would like to minimize L C cls ( C ) with respect to C and L G cls ( G ) with respect to G . Cycle Consistency Loss:
Training G , D and C using only thelosses presented above does not guarantee that G will preservethe linguistic content of input speech. As with CycleGAN-VC,we introduce a cycle consistency loss to be minimized L cyc ( G )= E k ∼ p ( k ) ,k ′ ∼ p ( k ) , x ∼ p d ( x | k ′ ) [ k G ( G ( x , k ) , k ′ ) − x k ρρ ] , (13)to encourage G ( x , k ) to preserve the linguistic content of x ,where x ∼ p d ( x | k ′ ) denotes a training example of real speechfeature sequences in domain k ′ and ρ is a positive constant.We also consider an identity mapping loss L id ( G ) = E k ′ ∼ p ( k ) , x ∼ p d ( x | k ′ ) [ k G ( x , k ′ ) − x k ρρ ] , (14)to ensure that an input into G will remain unchanged whenthe input already belongs to domain k ′ . Fig. 1. Illustration of CycleGAN training. Fig. 2. Illustration of C-StarGAN training. The D network is designed totake the domain index k as an additional input and produce the probabilityof x being a real data sample in domain k . To summarize, the full objectives to be minimized withrespect to G , D and C are given as I G ( G ) = λ adv L G adv ( G ) + λ cls L G cls ( G )+ λ cyc L cyc ( G ) + λ id L id ( G ) , (15) I D ( D ) = λ adv L D adv ( D ) , (16) I C ( C ) = λ cls L C cls ( C ) , (17)respectively, where λ adv ≥ , λ cls ≥ , λ cyc ≥ and λ id ≥ are regularization parameters, which weigh the importance ofthe adversarial, domain classification, cycle consistency, andidentity mapping losses. Since the adversarial and domainclassification losses in Eqs. (9), (10), (11) and (12) are definedusing cross-entropy measures, we refer to this version ofStarGAN as “C-StarGAN” (Fig. 2). B. Wasserstein StarGAN formulation
Next, we describe the original StarGAN formulation [47].It is frequently reported that optimization in regular GANtraining can often get unstable. It has been shown that using across-entropy measure as the minimax objective correspondsto optimizing the Jensen-Shannon (JS) divergence betweenthe real data distribution and the generator’s distribution [37].As discussed in [57], the reason why regular GAN trainingtends to easily get unstable can be explained by the factthat the JS divergence will be maxed out when the twodistributions are distant from each other so that they havedisjoint supports. It is probable that this can also happen inthe StarGAN training when using a cross-entropy measure.With the aim of stabilizing training, the original StarGANadopts the Wasserstein distance, which provides a meaningfuldistance metric between two distributions even for those ofdisjoint supports, instead of the cross-entropy measure asthe training objective. By using the Kantorovich-Rubinsteinduality theorem [58], a tractable form of the Wassersteindistance between the real speech feature distribution p ( x ) andthe distribution of the fake samples generated by the generator G ( x , k ) where x ∼ p d ( x ) and k ∼ p ( k ) is given by W ( G ) = max D ∈D (cid:8) E y ∼ p d ( y ) [ D ( y )] − E k ∼ p ( k ) , x ∼ p d ( x ) [ D ( G ( x , k ))] (cid:9) , (18)where D must lie within the space D of 1-Lipschitz functions.A 1-Lipschtiz function is a differentiable function that hasgradients with norm at most 1 everywhere. This Lipschtizconstraint is derived as a result of obtaining the above form ofthe Wasserstein distance [58]. As (18) shows, the computationof the Wasserstein distance requires optimization with respectto a function D . Thus, if we describe D using a neuralnetwork, the problem of minimizing W ( G ) with respect to G leads to a minimax game played by G and D similar tothe regular GAN training, where D plays a similar role tothe discriminator. Now, recall that the function D must be1-Lipschitz. Although there are several ways to constrain D ,such as the weight clipping technique adopted in [57], onesuccessful and convenient way involves imposing a penaltyon the sampled gradients of D R ( D ) = E ˆ x ∼ p (ˆ x ) [( k∇ D (ˆ x ) k − ] , (19)and including it in the training objective [59], where ∇ denotesthe gradient operator and ˆ x is a sample uniformly drawn alonga straight line between a pair of a real and a generated samples.We must also consider incorporating the domain classificationloss to encourage G ( x , k ) to belong to class k and the cycle-consistency loss to encourage G ( x , k ) to preserve the linguisticinformation in the input x . Overall, the training objectives tobe minimized with respect to G , D and C become I G ( G ) = − λ adv E k ∼ p ( k ) , x ∼ p d ( x ) [ D ( G ( x , k ))]+ λ cls L G cls ( G ) + λ cyc L cyc ( G ) + λ id L id ( G ) , (20) I D ( D ) = λ adv E k ∼ p ( k ) , x ∼ p d ( x ) [ D ( G ( x , k ))] − λ adv E y ∼ p d ( y ) [ D ( y )]+ λ gp E ˆ x ∼ p (ˆ x ) [( k∇ D (ˆ x ) k − ] , (21) I C ( C ) = λ cls L C cls ( C ) , (22)where λ gp ≥ is for weighing the imprtance of the gradientpenalty. We refer to this version of StarGAN as “Wasserstein Fig. 3. Illustration of W-StarGAN training. The D and C networks aredesigned to share lower layers and produce the score that measures howlikely x is to be a real data sample and the probability of x belonging toeach domain. Fig. 4. Illustration of A-StarGAN training. The A network is designedto produce K probabilities, where the first and second K probabilitiescorrespond to real and fake classes, and simultaneously play the roles ofthe real/fake discriminator and the domain classifier. StarGAN (W-StarGAN)”. It should be noted that the authorsof [47] choose to implement D and C as a single multi-task classifier network that simultaneously produces the values D ( x ) and p C ( k | x ) ( k = 1 , . . . , K ) (Fig. 3). C. Proposed New StarGAN formulation
With the two StarGAN formulations presented above, theability of G to appropriately convert its input into a targetdomain depends on how the decision boundary formed by C becomes during training. The domain classification loss canbe easily made almost 0 by letting the samples of G ( x , k ) resemble for example only a few of the real speech samplesin domain k near the decision boundary. In such situations, G will have no incentive to attempt to make the generatedsamples get closer to the rest of the real speech samplesdistributed in domain k . As a result, the conversion effect ofthe trained G will be limited. One reasonable way to avoidsuch situations would be to consider additional classes forout-of-distribution samples that do not belong to any of thedomains and encourage G to not generate samples belongingto those classes. This idea can be formulated as follows.First, we unify the real/fake discriminator and the domainclassifier into a single multiclass classifier A that outputs K probabilities p A ( k | x ) ( k = 1 , . . . , K ) where k = 1 , . . . , K and k = K + 1 , . . . , K correspond to the real domainclasses and the fake classes, respectively. Note that this differsfrom the multi-task classifier network mentioned above in that p A ( k | x ) must now satisfy P Kk =1 p A ( k | x ) = 1 . Here, the K fake classes can be seen as the classes for out-of-distributionsamples. Next, by using this multiclass classifier, we define L A adv ( A ) = − E k ∼ p ( k ) , y ∼ p d ( y | k ) [log p A ( k | y )] − E k ∼ p ( k ) , x ∼ p d ( x ) [log p A ( K + k | G ( x , k ))] , (23) L G adv ( G ) = − E k ∼ p ( k ) , x ∼ p d ( x ) [log p A ( k | G ( x , k ))]+ E k ∼ p ( k ) , x ∼ p d ( x ) [log p A ( K + k | G ( x , k ))] , (24)as adversarial losses for classifier A and generator G . L A adv ( A ) becomes small when A correctly classifies y ∼ p d ( y | k ) as real speech samples in domain k and G ( x , k ) as fake samples indomain k whereas L G adv ( G ) becomes small when G fools A so that G ( x , k ) is misclassified by A as real speech samplesin domain k and is not classified as fake samples.We will show below that this minimax game reaches aglobal optimum when p d ( y | k ) = p G ( y | k ) for k = 1 , . . . , K if both G and A have infinite capacity where p G ( y | k ) denotesthe distribution of y = G ( x , k ) with x ∼ p d ( x ) . We firstconsider the optimal classifier A for any given generator G . Proposition 1.
For fixed G , L A adv ( A ) is minimized when p ∗ A ( k | y ) = p ( k ) p ( y | k ) P k p ( k ) p d ( y | k ) + P k p ( k ) p G ( y | k ) , (25) p ∗ A ( K + k | y ) = p ( k ) p G ( y | k ) P k p ( k ) p d ( y | k ) + P k p ( k ) p G ( y | k ) , (26) for k = 1 , . . . , K . Proof: .
By differentiating the Lagrangian L ( A, γ ) = L A adv ( A ) + Z γ ( y ) K X k =1 p A ( k | y ) − ! d y (27)with respect to p A ( k | y ) ∂L ( A, γ ) ∂p A ( k | y ) = ( − p ( k ) p d ( y | k ) p A ( k | y ) + γ ( y ) (1 ≤ k ≤ K ) − p ( k ) p G ( y | k ) p A ( k | y ) + γ ( y ) ( K +1 ≤ k ≤ K ) and setting the result at zero, we obtain p A ( k | y ) = ( p ( k ) p d ( y | k ) /γ ( y ) (1 ≤ k ≤ K ) p ( k ) p G ( y | k ) /γ ( y ) ( K +1 ≤ k ≤ K ) . (28)Since p A ( k | y ) must sum to unity, the multiplier γ must be γ ( y ) = K X k =1 p ( k ) p d ( y | k ) + K X k =1 p ( k ) p G ( y | k ) . (29)Substituting (29) into (28) concludes the proof. Theorem 1.
The global optimum of the minimax game isachieved when p d ( y | k ) = p G ( y | k ) for k = 1 , . . . , K . Proof: .
By substituting (25) and (26) into L G adv ( G ) , we candescribe it as a function of G only: L G adv ( G ) = − E k ∼ p ( k ) , y ∼ p G ( y | k ) (cid:20) log p ∗ A ( k | y ) p ∗ A ( K + k | y ) (cid:21) = E k ∼ p ( k ) , y ∼ p G ( y | k ) (cid:20) log p G ( y | k ) p d ( y | k ) (cid:21) = E k ∼ p ( k ) KL[ p G ( y | k ) k p d ( y | k )] , (30)where KL[ ·k· ] denotes the Kullback-Leibler (KL) divergence.Obviously, L G adv ( G ) becomes 0 if and only if p d ( y | k ) = p G ( y | k ) for k = 1 , . . . , K , thus concluding the proof.We must also consider incorporating the cycle-consistencyand identity mapping losses to encourage G ( x , k ) to preservethe linguistic information in the input x as with the first twoformulations. Overall, the training objectives to be minimizedwith respect to G and A become I G ( G ) = λ adv L G adv ( G ) + λ cyc L cyc ( G ) + λ id L id ( G ) , (31) I A ( A ) = λ adv L A adv ( A ) . (32)We refer to this formulation as the “augmented multiclassclassifier StarGAN (A-StarGAN)” (Fig. 4).A comparative look at the C-StarGAN [46] and A-StarGANformulations may provide intuitive insights into the behaviorof the A-StarGAN training. Although not explicitly stated,with C-StarGAN, the minimax game played by G and D using (9) and (10) only is shown to correspond to minimizingthe JS divergence between p G ( y | k ) and p d ( y | k ) . While thisminimax game only cares whether G ( x , k ) resembles realsamples in domain k and does not concern whether G ( x , k ) islikely to belong to a different domain k ′ = k , A-StarGAN isdesigned to require G ( x , k ) to keep away from all the domainsexcept k by explicitly penalizing G ( x , k ) for resembling realsamples in domain k ′ = k . We expect that this particularmechanism can contribute to enhancing the conversion effect.The domain classification loss given as (12) in C-StarGANis expected to play this role, however, its effect can be lim-ited for the reason already mentioned. With A-StarGAN, theclassifier augmented with the fake classes creates additionaldecision boundaries, each of which is expected to partitionthe region of each domain into in-distribution and out-of-distribution regions thanks to the adversarial learning and thusencourage the generator to generate samples that resemble realin-distribution samples only. It should also be noted that in C-StarGAN, when the domain classification loss comes into play,the training objective does not allow for an interpretation of theoptimization process as distribution fitting, unlike A-StarGAN.This is also true for the W-StarGAN formulation.From the above discussion, we can also think of anotherversion of the A-StarGAN formulation, in which the K fakeclasses are merged into a single fake class (so the classifier A now produces only K + 1 probabilities) and the adversariallosses for classifier A and generator G are defined as L A adv ( A ) = − E k ∼ p ( k ) , y ∼ p d ( y | k ) [log p A ( k | y )] − E k ∼ p ( k ) , x ∼ p d ( x ) [log p A ( K +1 | G ( x , k ))] , (33) L G adv ( G ) = − E k ∼ p ( k ) , x ∼ p d ( x ) [log p A ( k | G ( x , k ))] + E k ∼ p ( k ) , x ∼ p d ( x ) [log p A ( K +1 | G ( x , k ))] . (34)It should be noted that the minimax game using these losses nolonger leads to the minimization of the KL divergence between p G ( y | k ) and p d ( y | k ) . However, we still believe it can workreasonably well if the augmented classifier really behaves inthe way discussed above. D. Acoustic feature
In this paper, we choose to use mel-cepstral coefficients(MCCs) computed from a spectral envelope obtained usingWORLD [60], [61] as the acoustic feature to be converted.Although it would also be interesting to consider directlyconverting time-domain signals (for example, like [62]), giventhe recent significant advances in high-quality neural vocodersystems [49], [63]–[72], we would expect to generate high-quality signals by using a neural vocoder once we could obtaina sufficient set of acoustic features. Such systems can beadvantageous in that the model size for the generator can bemade small enough to allow the system to run in real-timeand work well even when a limited amount of training data isavailable.At training time, we normalize each element x q,n of theMCC sequence x to x q,n ← ( x q,n − ψ q ) /ζ q where q denotesthe dimension index of the MCC sequence, n denotes theframe index, and ψ q and ζ q denote the means and standarddeviations of the q -th MCC sequence within all the voicedsegments of the training samples of the same speaker. E. Conversion process
After training G , we can convert the acoustic feature se-quence x of an input utterance with ˆ y = G ( x , k ) , (35)where k denotes the target domain. Once ˆ y has been obtained,we adjust the mean and variance of the generated featuresequence so that they match the pretrained mean and varianceof the feature vectors of the target speaker. We can thengenerate a time-domain signal using the WORLD vocoder orany recently developed neural vocoder [49], [63]–[74]. F. Network architectures
The architectures of all the networks are detailed in Figs. 5–9. As detailed below, G is designed to take an acoustic featuresequence as an input and output an acoustic feature sequenceof the same length so as to learn conversion rules that capturetime dependencies. Similarly, D , C and A are designed to takeacoustic feature sequences as inputs and generate sequences ofprobabilities. There are two ways to incorporate the class index k into G or D . One is to simply represent it as a one-hot vectorand append it to the input of each layer. The other is to retrievea continuous vector given k from a dictionary of embeddingsand append it to each layer input, as in our previous work [52],[54]. In the following, we adopt the former way though bothperformed almost the same. As detailed in Figs. 5–9, all thenetworks are designed using fully convolutional architecturesusing gated linear units (GLUs) [75]. The output of a GLU is defined as GLU ( X ) = X ⊙ sigmoid ( X ) where X is theinput, X and X are equally sized arrays into which X is splitalong the channel dimension, and sigmoid is a sigmoid gatefunction. Similar to long short-term memory units, GLUs canreduce the vanishing gradient problem for deep architecturesby providing a linear path for the gradients while retainingnon-linear capabilities. Generator:
As described in Figs. 5 and 6, we use a 2D CNNor a 1D CNN that takes an acoustic feature sequence x asan input to design G , where x is treated as an image of size Q × N with channel in the 2D case or as a signal sequenceof length N with Q channel in the 1D case. Real/Fake Discriminator:
We leverage the idea of Patch-GANs [76] to design a real/fake discriminator or a Lipschitzcontinuous function D , which assigns a probability or ascore to each local segment of an input feature sequence,indicating whether it is real or fake. More specifically, D takesan acoustic feature sequence y as an input and produces asequence of probabilities (with C-StarGAN) or scores (withW-StarGAN) that measures how likely each segment of y isto be real speech features. With C-StarGAN, the final outputof D is given by the product of all these probabilities and withW-StarGAN, the final output of D is given by the sum of allthese scores. Domain Classifier/Augmented Classifier:
We also design thedomain classifier C and the augmented classifier A so thateach of them takes an acoustic feature sequence y as an inputand produces a sequence of class probability distributions thatmeasure how likely each segment of y is to belong to domain k . The final output of p C ( k | y ) or p A ( k | y ) is given by theproduct of all these distributions.V. E XPERIMENTS
A. Datasets
To confirm the effects of the proposed StarGAN formula-tions, we conducted objective and subjective evaluation ex-periments involving a non-parallel speaker identity conversiontask. For the experiments, we used two datasets: One is theCMU ARCTIC database [77], which consists of recordingsof two female US English speakers (‘clb’ and ‘slt’) and twomale US English speakers (‘bdl’ and ‘rms’) sampled at 16,000Hz. The other is the Voice Conversion Challenge (VCC) 2018dataset [78], which consists of recordings of six female andsix male US English speakers sampled at 22,050 Hz. Fromthe VCC2018 dataset, we selected two female speakers (‘SF1’and ‘SF2’) and two male speakers (‘SM1’ and ‘SM2’). Thus,for each dataset, there were K = 4 speakers and so in totalthere were twelve different combinations of source and targetspeakers.
1) The CMU ARCTIC Dataset:
The CMU ARCTIC datasetconsisted of four speakers, each reading the same 1,132 shortsentences. For each speaker, we used the first 1,000 and thelatter 132 sentences for training and evaluation. To simulatea non-parallel training scenario, we divided the first 1,000sentences equally into four groups and used only the first,second, third, and fourth groups for speakers clb, bdl, slt, andrms, so as not to use the same sentences between different speakers. The training utterances of speakers clb, bdl, slt,and rms were about 12, 11, 11, and 14 minutes long intotal, respectively. For each utterance, we extracted a spectralenvelope, a logarithmic fundamental frequency (log F ), andaperiodicities (APs) every 8 ms using the WORLD analyzer[60], [61]. We then extracted Q = 28 MCCs from eachspectral envelope using the Speech Processing Toolkit (SPTK)[79].
2) The VCC2018 Dataset:
The subset of the VCC2018dataset consisted of four speakers, each reading the same116 short sentences (about 7 minutes long in total). For eachspeaker, we used the first 81 and the latter 35 sentences (about5 and 2 minutes long in total) for training and evaluation.Although we could actually construct a parallel corpus usingthis dataset, we took care not to take advantage of it tosimulate a non-parallel training scenario. For each utterance,we extracted a spectral envelope, a log F , APs, and Q = 36 MCCs every 5 ms using the WORLD analyzer [60], [61] andthe SPTK [79] in the same manner.For both datasets, the F contours were converted usingthe logarithm Gaussian normalized transformation describedin [80]. The APs were used directly without modification.The signals of the converted speech were obtained using themethods described in IV-E. B. Baseline Methods
We chose the VAE-based [22] and VAEGAN-based [23]non-parallel VC methods and our previously proposedCycleGAN-VC [26] for comparison. In CycleGAN-VC, weused the same network architectures shown in Figs. 5–7 todesign the generator and discriminator. To clarify how closethe proposed method can get to the performance achieved byone of the best performing parallel
VC methods, we alsochose a GMM-based open-source method called “sprocket”[81] for comparison. This method was used as a baseline inthe VCC2018 [78]. Note that since sprocket is a parallel VCmethod, we tested it only on the VCC2018 dataset. To runthese methods, we used the source codes provided by theauthors [82]–[84].
C. Hyperparameter Settings
In the following, we use the abbreviations A-StarGAN1 andA-StarGAN2 to indicate the A-StarGAN formulations using(23) and (24) and using (33) and (34) as the adversarial losses.Hence, there were four different versions of the StarGANformulations (namely, C-StarGAN, W-StarGAN, A-StarGAN1and A-StarGAN2) considered for comparison.All the networks were trained simultaneously with randominitialization. Adam optimization [85] was used for modeltraining where the mini-batch size was 16. The settings ofthe regularization parameters λ adv , λ cls , λ cyc , λ id , and λ gp ,the learning rates α G and α D/C for the generator and thediscriminator/classifier, and the iteration number I are listedin Tab. I. For CycleGAN and all the StarGAN versions,the exponential decay rate for the first moment was set at0.9 for the generator and at 0.5 for the discriminator andclassifier. Fig. 10 shows the learning curves of the four Fig. 5. Network architectures of the generator designed using 2D convolution layers. Here, the input and output of each layer are interpreted as images,where “h”, “w” and “c” denote the height, width and channel number, respectively. “Conv2d”, “BatchNorm”, “GLU”, “Deconv2d” denote 2D convolution,batch normalization, gated linear unit, and 2D transposed convolution layers, respectively. Batch normalization is applied per-channel and per-height of theinput. “k”, “c” and “s” denote the kernel size, output channel number and stride size of a convolution layer, respectively. The class index, represented as aone-hot vector, is concatenated to the input of each convolution layer along the channel directon after being repeated along the height and width directionsso that it has the shape compatible with the input.Fig. 6. Network architectures of the generator designed using 1D convolution layers. Here, the input and output of the generator are interpreted as signalsequences, where “l” and “c” denote the length and channel number, respectively. “Conv1d”, “BatchNorm”, “GLU”, “Deconv1d” denote 1D convolution,batch normalization, gated linear unit, and 1D transposed convolution layers, respectively. Batch normalization is applied per-channel of the input. The classindex vector is concatenated to the input of each convolution layer after being repeated along the time direction.Fig. 7. Network architectures of the conditional discriminator in C-StarGAN.“Sigmoid” denotes an element-wise sigmoid function. Fig. 8. Network architectures of the multi-task classifier in W-StarGAN.“Softmax” denotes a softmax function applied to the channel dimension.Fig. 9. Network architectures of the classifier in C-StarGAN and A-StarGAN. The output channel number L is set to K for the domain classifier in C-StarGAN, K for the augmented classifier in A-StarGAN1, and K + 1 for the augmented classifier in A-StarGAN2, respectively. The channel number M inthe intermediate layers is set to 16 for the domain classifier in C-StarGAN and 64 for the augmented classifier in A-StarGAN1 and A-StarGAN2, respectively. StarGAN versions under the above settings. We performedbatch normalization with training mode also at test time.
D. Objective Performance Measure
In each dataset, the test set consists of speech samples ofeach speaker reading the same sentences. Here, we used theaverage of the MCDs taken along the dynamic time warping(DTW) path between converted and target feature sequencesas the objective performance measure for each test utterance.
TABLE IH
YPERPARAMETER SETTINGS
CycleGAN C-StarGAN W-StarGAN A-StarGAN λ adv λ cls λ cyc λ id λ gp – – 10 – α G × − × − × − × − α D / C × − × − × − × − ρ I . × × . × . × (a) C-StarGAN (b) W-StarGAN (c) A-StarGAN1 (d) A-StarGAN2 Fig. 10. Training loss curves of (a) C-StarGAN, (b) W-StarGAN, (c) A-StarGAN1, and (d) A-StarGAN2.TABLE IIMCD C
OMPARISONS OF D IFFERENT N ETWORK C ONFIGURATIONS OF G ON THE
CMU ARCTIC D
ATASET
Speakers CycleGAN C-StarGAN W-StarGAN A-StarGAN1 A-StarGAN2src trg 1D 2D 1D 2D 1D 2D 1D 2D 1D 2Dbdl . ± . . ± . . ± . . ± .
14 7 . ± . . ± .
11 7 . ± . . ± . . ± . . ± . clb slt . ± . . ± . . ± . . ± . . ± . . ± .
05 6 . ± . . ± . . ± . . ± . rms . ± . . ± .
08 8 . ± . . ± . . ± . . ± .
06 7 . ± . . ± .
08 7 . ± . . ± . clb . ± . . ± . . ± . . ± . . ± . . ± .
12 7 . ± . . ± . . ± . . ± . bdl slt . ± . . ± .
13 8 . ± . . ± . . ± . . ± .
06 7 . ± . . ± . . ± . . ± . rms . ± . . ± . . ± . . ± . . ± . . ± .
17 7 . ± . . ± . . ± . . ± . clb . ± . . ± . . ± . . ± . . ± . . ± .
08 7 . ± . . ± . . ± . . ± . slt bdl . ± . . ± . . ± . . ± .
12 7 . ± . . ± .
08 7 . ± . . ± . . ± . . ± . rms . ± . . ± . . ± . . ± . . ± . . ± .
09 7 . ± . . ± . . ± . . ± . clb . ± . . ± .
10 8 . ± . . ± . . ± . . ± .
07 7 . ± . . ± . . ± . . ± . rms bdl . ± . . ± .
18 8 . ± . . ± .
18 8 . ± . . ± .
19 7 . ± . . ± . . ± . . ± . slt . ± . . ± . . ± . . ± . . ± . . ± .
10 7 . ± . . ± . . ± . . ± . All pairs . ± . . ± .
04 8 . ± . . ± . . ± . . ± .
04 7 . ± . . ± . . ± . . ± . TABLE IIIMCD C
OMPARISONS WITH BASELINE METHODS ON THE
CMU ARCTIC
DATASET
Speakers VAE VAEGAN CycleGAN C-StarGAN W-StarGAN A-StarGAN1 A-StarGAN2source targetbdl . ± .
10 8 . ± .
14 8 . ± .
16 8 . ± . . ± . . ± .
14 7 . ± . clb slt . ± .
07 8 . ± .
06 7 . ± .
06 6 . ± .
08 6 . ± . . ± . . ± . rms . ± .
05 8 . ± .
07 7 . ± .
06 8 . ± . . ± . . ± .
09 7 . ± . clb . ± .
09 8 . ± .
10 8 . ± .
14 7 . ± . . ± . . ± .
13 7 . ± . bdl slt . ± .
09 8 . ± .
08 8 . ± .
11 8 . ± . . ± . . ± .
10 7 . ± . rms . ± .
12 8 . ± .
11 8 . ± .
12 8 . ± .
14 7 . ± . . ± . . ± . clb . ± .
07 8 . ± .
09 7 . ± .
07 7 . ± . . ± . . ± .
08 6 . ± . slt bdl . ± .
08 7 . ± .
09 8 . ± .
10 8 . ± . . ± . . ± .
09 7 . ± . rms . ± .
10 8 . ± .
11 8 . ± .
10 8 . ± . . ± . . ± .
10 7 . ± . clb . ± .
07 8 . ± .
09 7 . ± .
07 7 . ± . . ± . . ± .
08 7 . ± . rms bdl . ± .
15 8 . ± .
17 8 . ± .
17 8 . ± .
18 8 . ± . . ± . . ± . slt . ± .
09 8 . ± .
11 8 . ± .
09 8 . ± .
13 7 . ± . . ± . . ± . All pairs . ± .
03 8 . ± .
03 8 . ± .
04 8 . ± . . ± . . ± .
04 7 . ± . TABLE IVMCD C
OMPARISONS WITH BASELINE METHODS ON THE
VCC2018
DATASET
Speakers non-parallel methods parallel methodsource target VAE VAEGAN CycleGAN C-StarGAN W-StarGAN A-StarGAN1 A-StarGAN2 sprocketSM1 . ± .
12 7 . ± .
12 7 . ± .
13 7 . ± . . ± . . ± .
13 7 . ± .
13 6 . ± . SF1 SF2 . ± .
12 7 . ± .
12 7 . ± .
16 7 . ± .
14 7 . ± .
13 7 . ± . . ± . . ± . SM2 . ± .
14 8 . ± .
15 7 . ± .
13 7 . ± .
14 7 . ± .
12 7 . ± . . ± . . ± . SF1 . ± .
10 8 . ± .
13 8 . ± .
12 7 . ± .
10 7 . ± .
10 7 . ± . . ± . . ± . SM1 SF2 . ± .
11 7 . ± .
12 6 . ± .
12 6 . ± .
12 6 . ± .
10 6 . ± . . ± . . ± . SM2 . ± .
11 7 . ± .
10 7 . ± .
09 7 . ± .
11 7 . ± . . ± . . ± .
10 6 . ± . SF1 . ± .
13 7 . ± .
12 7 . ± .
13 7 . ± .
12 7 . ± .
10 7 . ± . . ± . . ± . SF2 SM1 . ± .
11 7 . ± .
10 7 . ± .
11 7 . ± .
11 6 . ± . . ± . . ± .
13 6 . ± . SM2 . ± .
12 7 . ± .
11 7 . ± .
12 7 . ± .
13 7 . ± .
12 7 . ± . . ± . . ± . SF1 . ± .
15 8 . ± .
16 8 . ± .
17 8 . ± .
17 7 . ± .
15 7 . ± . . ± . . ± . SM2 SM1 . ± .
14 7 . ± .
14 7 . ± .
13 7 . ± .
12 7 . ± .
12 6 . ± . . ± . . ± . SF2 . ± .
14 7 . ± .
14 7 . ± .
16 7 . ± .
15 7 . ± .
14 7 . ± . . ± . . ± . All pairs . ± .
05 7 . ± .
05 7 . ± .
05 7 . ± .
05 7 . ± .
04 7 . ± . . ± . . ± . E. Objective Evaluations
First, we evaluated the performance of each StarGAN ver-sion with different network configurations of G . The detailedsettings for these configurations are shown in Figs. 5 and 6.The network architectures of the conditional discriminator anddomain classifier in C-StarGAN, the multi-task classifier inW-StarGAN, and the augmented classifier in A-StarGAN areshown in Figs. 7–9. Tab. II shows the average MCDs with95% confidence intervals obtained with these network configu-rations. The results show that the CycleGAN, C-StarGAN, W-StarGAN, A-StarGAN1, and A-StarGAN2 methods performedbetter with G designed using 1D-, 2D-, 2D-, 1D-, and 1D-CNNs, respectively. In the following, we only present theresults obtained with these configurations.Tabs. III and IV show the MCDs obtained with the proposedand baseline methods. As the results show, W-StarGAN and A-StarGAN1 performed best and next best on the CMU ARCTICdataset, and A-StarGAN2 and A-StarGAN1 performed bestand next best of all the non-parallel methods on the VCC2018dataset. All the StarGAN versions performed consistentlybetter than CycleGAN. Since both CycleGAN and C-StarGANuse the cross-entropy measure to define the adversarial losses,the superiority of C-StarGAN over CycleGAN reflects theeffect of the many-to-many extension. Now, let us turn to thecomparisons of the four StarGAN versions. From the results,we can see that W-StarGAN performed better than C-StarGANon both datasets, revealing the advantage of the training objec-tive defined using the Wasserstein distance with the gradientpenalty. We also confirmed that A-StarGAN1&2 performedeven better than W-StarGAN on the VCC2018 dataset thoughit performed slightly worse on the CMU ARCTIC dataset.We also confirmed that all the StarGAN versions could notyield higher performance than sprocket. Given the fact thatsprocket had the advantage of using parallel data for the modeltraining, we consider the current result to be promising, sincethe proposed methods are already advantageous in that theycan be applied in non-parallel training scenarios.Balancing the learning of the palyers in a minimax gameis essential in the GAN framework. The probabilities of thefeature sequence converted from each test sample being realand produced by the target speaker may provide an indicationof how successfully the generator, discriminator, and classifierhave been trained in a balanced manner. Tabs. V and VIshow the mean outputs of the discriminator and classifier ofC-StarGAN, W-StarGAN, and A-StarGAN1&2 at test time.Note that since the discriminator in W-StarGAN producesscores (instead of probabilities), which are not straightforwardto interpret, we have omitted them in Tab. V. As for theaugmented classifier in A-StarGAN1&2, if we use p k,n todenote an element of the classifier output corresponding tothe probability of the classifier input belonging to class k at segment n , the values P Kk =1 p k,n and p k,n / P Kk =1 p k,n correspond to the probabilities of the classifier input being realand produced by speaker k , respectively, at that segment. Themeans of these values are shown in Tabs. V and VI. As Tabs.V and VI indicate, the generators in all the StarGAN versionswere successful in confusing the discriminator and making the classifier believe that the feature sequence converted from eachtest sample were produced by the target speaker.The modulation spectra of MCC sequences are known tobe quantities that are closely related to perceived qualityand naturalness of speech [86]. Fig. 11 shows an exampleof the average modulation spectra of the converted MCCsequences obtained with the proposed and baseline methodsalong with those of the real speech of the target speaker. Asthis example shows, the modulation spectra obtained with theCycleGAN-based method and all the StarGAN-based methodswere relatively closer to those of real speech than the VAE-based and VAEGAN-based methods over the entire frequencyrange, thanks to the adversarial training strategy. F. Subjective Listening Tests
We conducted mean opinion score (MOS) tests to comparethe speech quality and speaker similarity of the convertedspeech samples obtained with the proposed and baselinemethods. For these tests, we used the CMU ARCTIC dataset.24 listeners (including 22 native Japanese speakers) partici-pated in both tests. The tests were conducted online, whereeach participant was asked to use a headphone in a quietenvironment.With the speech quality test, we included the speech samplessynthesized in the same way as the proposed and baselinemethods (namely the WORLD synthesizer) using the acousticfeatures directly extracted from real speech samples. Hence,the scores of these samples are expected to show the upperlimit of the performance. Speech samples were presented inrandom orders to eliminate bias as regards the order of thestimuli. Each listener was asked to evaluate the naturalness byselecting 5: Excellent, 4: Good, 3: Fair, 2: Poor, or 1: Badfor each utterance. The results are shown in Fig. 12. As theresults show, A-StarGAN1 performed slightly better than W-StarGAN and A-StarGAN2 (although the differences were notsignificant) and significantly better than C-StarGAN and theVAE and VAEGAN methods. However, it also became clearthat the speech quality obtained with all the methods testedhere was still perceptually distinguishable from real speechsamples.With the speaker similarity test, each listener was givena converted speech sample and a real speech sample of thecorresponding target speaker and was asked to evaluate howlikely they are to be produced by the same speaker by selecting5: Definitely, 4: Likely, 3: Fair, 2: Not very likely, or 1:Unlikely. The results are shown in Fig. 13. As can be seen fromthe results, the W-StarGAN and A-StarGAN formulationsperformed comparably to each other and showed significantlybetter conversion ability than the remaining four methods.VI. C
ONCLUSIONS
In this paper, we proposed a method that allows non-parallelmulti-domain VC based on StarGAN. We described threeformulations of StarGAN and compared them and severalbaseline methods in a non-parallel speaker identity conversiontask. Through objective evaluations, we confirmed that ourmethod was able to convert speaker identities reasonably well TABLE VR
EAL /F AKE D ISCRIMINATION A CCURACY (%)Speakers C-StarGAN A-StarGAN1 A-StarGAN2source targetbdl 63.19 48.10 41.80clb slt 76.56 47.42 28.43rms 16.36 51.51 40.43clb 17.73 46.77 20.49bdl slt 87.83 47.64 29.46rms 3.93 51.00 38.14clb 22.56 47.82 22.37slt bdl 69.25 48.42 40.71rms 9.56 50.53 33.65clb 28.81 48.48 25.74rms bdl 60.52 48.60 40.41slt 72.18 47.53 29.19All pairs 44.04 48.65 32.56 TABLE VIS
PEAKER C LASSIFICATION A CCURACY (%)Speakers C-StarGAN W-StarGAN A-StarGAN1 A-StarGAN2source targetbdl 96.07 99.99 99.83 98.58clb slt 96.29 99.70 96.92 80.19rms 93.29 99.97 99.97 92.34clb 94.23 99.87 99.38 87.90bdl slt 96.22 98.63 99.48 86.98rms 90.91 99.98 99.93 93.43clb 80.78 99.38 98.78 93.66slt bdl 96.15 99.99 99.72 93.94rms 92.24 99.99 99.90 90.95clb 86.04 98.55 99.42 92.81rms bdl 95.65 99.99 98.51 95.74slt 99.98 99.99 99.36 89.11All pairs 93.15 99.67 99.27 91.30Fig. 11. Average modulation spectra of the 5-th, 10-th and 20-th dimensions of the converted MCC sequences obtained with the baseline methods and theStarGAN-based methods.Fig. 12. Results of MOS test for speech quality Fig. 13. Results of MOS test for speaker similarity using only several minutes of training examples. Interestedreaders are referred to [87], [88] for our investigations ofother network architecture designs and improved techniquesfor CycleGAN-VC and StarGAN-VC.One limitation of the proposed method is that it can onlyconvert input speech to the voice of a speaker seen in agiven training set. It owes to the fact that one-hot encoding(or a simple embedding) used for speaker conditioning isnongeneralizable to unseen speakers. An interesting topic for future work includes developing a zero-shot VC system thatcan convert input speech to the voice of an unseen speaker bylooking at only a few of his/her utterances. As in the recentwork [89], one possible way to achieve this involves usinga speaker embedding pretrained based on a metric learningframework for speaker conditioning. A CKNOWLEDGMENT
This work was supported by JSPS KAKENHI 17H01763and JST CREST Grant Number JPMJCR19A3, Japan.R
EFERENCES[1] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-speechsynthesis,” in
Proc. International Conference on Acoustics, Speech, andSignal Processing (ICASSP) , 1998, pp. 285–288.[2] A. B. Kain, J.-P. Hosom, X. Niu, J. P. van Santen, M. Fried-Oken, andJ. Staehely, “Improving the intelligibility of dysarthric speech,”
SpeechCommunication , vol. 49, no. 9, pp. 743–759, 2007.[3] K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using GMM-based voice conversion for electrolaryngealspeech,”
Speech Communication , vol. 54, no. 1, pp. 134–146, 2012.[4] Z. Inanoglu and S. Young, “Data-driven emotion conversion in spokenEnglish,”
Speech Communication , vol. 51, no. 3, pp. 268–283, 2009.[5] O. T¨urk and M. Schr¨oder, “Evaluation of expressive speech synthesiswith voice conversion and copy resynthesis techniques,”
IEEE Transac-tions on Audio, Speech, and Language Processing , vol. 18, no. 5, pp.965–973, 2010.[6] T. Toda, M. Nakagiri, and K. Shikano, “Statistical voice conversiontechniques for body-conducted unvoiced speech enhancement,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 20, no. 9,pp. 2505–2517, 2012.[7] P. Jax and P. Vary, “Artificial bandwidth extension of speech signalsusing MMSE estimation based on a hidden Markov model,” in
Proc.International Conference on Acoustics, Speech, and Signal Processing(ICASSP) , 2003, pp. 680–683.[8] D. Felps, H. Bortfeld, and R. Gutierrez-Osuna, “Foreign accent conver-sion in computer assisted pronunciation training,”
Speech Communica-tion , vol. 51, no. 10, pp. 920–932, 2009.[9] Y. Stylianou, O. Capp´e, and E. Moulines, “Continuous probabilistictransform for voice conversion,”
IEEE Trans. SAP , vol. 6, no. 2, pp.131–142, 1998.[10] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based onmaximum-likelihood estimation of spectral parameter trajectory,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 15, no. 8,pp. 2222–2235, 2007.[11] E. Helander, T. Virtanen, J. Nurminen, and M. Gabbouj, “Voice conver-sion using partial least squares regression,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 18, no. 5, pp. 912–921, 2010.[12] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectralmapping using artificial neural networks for voice conversion,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 18, no. 5,pp. 954–964, 2010.[13] S. H. Mohammadi and A. Kain, “Voice conversion using deep neuralnetworks with speaker-independent pre-training,” in
Proc. IEEE SpokenLanguage Technology Workshop (SLT) , 2014, pp. 19–23.[14] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion using deepbidirectional long short-term memory based recurrent neural networks,”in
Proc. International Conference on Acoustics, Speech, and SignalProcessing (ICASSP) , 2015, pp. 4869–4873.[15] Y. Saito, S. Takamichi, and H. Saruwatari, “Voice conversion usinginput-to-output highway networks,”
IEICE Trans Inf. Syst. , vol. E100-D,no. 8, pp. 1925–1928, 2017.[16] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to-sequence voice conversion with similarity metric learned usinggenerative adversarial networks,” in
Proc. Annual Conference of theInternational Speech Communication Association (Interspeech) , 2017,pp. 1283–1287.[17] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conver-sion using deep neural networks with layer-wise generative training,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 22, no. 12, pp. 1859–1872, 2014.[18] T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice conversion based onspeaker-dependent restricted Boltzmann machines,”
IEICE Transactionson Information and Systems , vol. 97, no. 6, pp. 1403–1410, 2014.[19] T. Nakashika, T. Takiguchi, and Y. Ariki, “High-order sequence mod-eling using speaker-dependent recurrent temporal restricted Boltzmannmachines for voice conversion,” in
Proc. Annual Conference of theInternational Speech Communication Association (Interspeech) , 2014,pp. 2278–2282.[20] T. Nakashika, T. Takiguchi, and Y. Ariki, “Parallel-data-free, many-to-many voice conversion using an adaptive restricted Boltzmann machine,”in
Proc. MLSP , 2015. [21] M. Blaauw and J. Bonada, “Modeling and transforming speech usingvariational autoencoders,” in
Proc. Annual Conference of the Inter-national Speech Communication Association (Interspeech) , 2016, pp.1770–1774.[22] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voiceconversion from non-parallel corpora using variational auto-encoder,”in
Proc. Asia-Pacific Signal and Information Processing AssociationAnnual Summit and Conference (APSIPA ASC) , 2016, pp. 1–6.[23] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang,“Voice conversion from unaligned corpora using variational autoen-coding Wasserstein generative adversarial networks,” in
Proc. AnnualConference of the International Speech Communication Association(Interspeech) , 2017, pp. 3364–3368.[24] F.-L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNN-basedapproach to voice conversion without parallel training sentences,” in
Proc. Annual Conference of the International Speech CommunicationAssociation (Interspeech) , 2016, pp. 287–291.[25] T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi, “Non-parallel voiceconversion using i-vector PLDA: Towards unifying speaker verificationand transformation,” in
Proc. International Conference on Acoustics,Speech, and Signal Processing (ICASSP) , 2017, pp. 5535–5539.[26] T. Kaneko and H. Kameoka, “Parallel-data-free voice conversion us-ing cycle-consistent adversarial networks,” arXiv:1711.11293 [stat.ML] ,Nov. 2017.[27] A. van den Oord and O. Vinyals, “Neural discrete representationlearning,” in
Adv. Neural Information Processing Systems (NIPS) , 2017,pp. 6309–6318.[28] T. Hashimoto, H. Uchida, D. Saito, and N. Minematsu, “Parallel-data-free many-to-many voice conversion based on dnn integratedwith eigenspace using a non-parallel speech corpus,” in
Proc. AnnualConference of the International Speech Communication Association(Interspeech) , 2017.[29] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Non-parallel voiceconversion using variational autoencoders conditioned by phoneticposteriorgrams and d-vectors,” in
Proc. International Conference onAcoustics, Speech, and Signal Processing (ICASSP) , 2018, pp. 5274–5278.[30] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACVAE-VC:Non-parallel voice conversion with auxiliary classifier variational au-toencoder,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, no. 9, pp. 1432–1443, 2019.[31] R. Takashima, T. Takiguchi, and Y. Ariki, “Exemplar-based voiceconversion using sparse representation in noisy environments,”
IEICETransactions on Information and Systems , vol. E96-A, no. 10, pp. 1946–1953, 2013.[32] Z. Wu, T. Virtanen, E. S. Chng, and H. Li, “Exemplar-based sparse repre-sentation with residual compensation for voice conversion,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 22,no. 10, pp. 1506–1521, 2014.[33] B. Sisman, M. Zhang, and H. Li, “Group sparse representation withwavenet vocoder adaptation for spectrum and prosody conversion,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 27, no. 6, pp. 1085–1097, 2019.[34] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-endfactor analysis for speaker verification,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011.[35] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in
Proc. International Conference on Learning Representations (ICLR) ,2014.[36] D. P. Kingma, D. J. Rezendey, S. Mohamedy, and M. Welling, “Semi-supervised learning with deep generative models,” in
Adv. NeuralInformation Processing Systems (NIPS) , 2014, pp. 3581–3589.[37] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Adv. Neural Information Processing Systems (NIPS) , 2014, pp. 2672–2680.[38] T. Kaneko, H. Kameoka, N. Hojo, Y. Ijima, K. Hiramatsu, andK. Kashino, “Generative adversarial network-based postfilter for sta-tistical parametric speech synthesis,” in
Proc. International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP) , 2017, pp. 4910–4914.[39] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical parametric speechsynthesis incorporating generative adversarial networks,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 26, no. 1,pp. 84–96, Jan. 2018.[40] S. Pascual, A. Bonafonte, and J. Serr´a, “SEGAN: Speech enhancementgenerative adversarial network,” arXiv:1703.09452 [cs.LG] , Mar. 2017. [41] T. Kaneko, S. Takaki, H. Kameoka, and J. Yamagishi, “Generativeadversarial network-based postfilter for STFT spectrograms,” in Proc.Annual Conference of the International Speech Communication Associ-ation (Interspeech) , 2017, pp. 3389–3393.[42] K. Oyamada, H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, and H. Ando,“Generative adversarial network-based approach to signal reconstructionfrom magnitude spectrograms,” arXiv:1804.02181 [eess.SP] , Apr. 2018.[43] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in
Proc.International Conference on Computer Vision (ICCV) , 2017, pp. 2223–2232.[44] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discovercross-domain relations with generative adversarial networks,” in
Proc.International Conference on Machine Learning (ICML) , 2017, pp. 1857–1865.[45] Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsuperviseddual learning for image-to-image translation,” in
Proc. InternationalConference on Computer Vision (ICCV) , 2017, pp. 2849–2857.[46] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarialnetworks,” in
Proc. IEEE Spoken Language Technology Workshop (SLT) ,2018, pp. 266–273.[47] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN:Unified generative adversarial networks for multi-domain image-to-image translation,” arXiv:1711.09020 [cs.CV] , Nov. 2017.[48] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,“Autoencoding beyond pixels using a learned similarity metric,” arXiv:1512.09300 [cs.LG] , Dec. 2015.[49] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet:A generative model for raw audio,” arXiv:1609.03499 [cs.SD] , Sep.2016.[50] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo,F. Stimberg, N. Casagrande, D. G., S. Noury, S. Dieleman, E. Elsen,N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, andD. Hassabis, “Parallel WaveNet: Fast high-fidelity speech synthesis,” arXiv:1711.10433 [cs.LG] , Nov. 2017.[51] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “AttS2S-VC:Sequence-to-sequence voice conversion with attention and contextpreservation mechanisms,” in
Proc. International Conference on Acous-tics, Speech, and Signal Processing (ICASSP) , 2019, pp. 6805–6809.[52] H. Kameoka, K. Tanaka, D. Kwa´sny, and N. Hojo, “ConvS2S-VC:Fully convolutional sequence-to-sequence voice conversion,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 28, pp.1849–1863, 2020.[53] W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda,“Voice transformer network: Sequence-to-sequence voice conversionusing transformer with text-to-speech pretraining,” arXiv:1912.06813[eess.AS] , Dec. 2019.[54] H. Kameoka, W.-C. Huang, K. Tanaka, T. Kaneko, N. Hojo, andT. Toda, “Many-to-many voice transformer network,” arXiv:2005.08445[eess.AS] , 2020.[55] J.-X. Zhang, Z.-H. Ling, and L.-R. Dai, “Non-parallel sequence-to-sequence voice conversion with disentangled linguistic and speakerrepresentations,” arXiv:1906.10508 [eess.AS] , 2019.[56] P. L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, and T. Toda,“Non-parallel voice conversion with cyclic variational autoencoder,” in
Proc. Annual Conference of the International Speech CommunicationAssociation (Interspeech) , 2019, pp. 674–678.[57] M. Arjovsky and L. Bottou, “Towards principled methods for traininggenerative adversarial networks,” in
Proc. International Conference onLearning Representations (ICLR) , 2017.[58] C. Villani,
Optimal Transport: Old and New , ser. Grundlehren der mathe-matischen Wissenschaften. Springer Berlin Heidelberg, 2008. [Online].Available: https://books.google.co.jp/books?id=hV8o5R7 5tkC[59] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville,“Improved training of wasserstein gans,” in
Adv. Neural InformationProcessing Systems (NIPS) , 2017, pp. 5769–5779.[60] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-basedhigh-quality speech synthesis system for real-time applications,”
IEICETransactions on Information and Systems , vol. E99-D, no. 7, pp. 1877–1884, 2016.[61] https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder.[62] J. Serr`a, S. Pascual, and C. Segura, “Blow: A single-scale hy-perconditioned flow for non-parallel raw-audio voice conversion,” arXiv:1906.00794 [cs.LG] , Jun. 2019. [63] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,“Speaker-dependent WaveNet vocoder,” in
Proc. Annual Conferenceof the International Speech Communication Association (Interspeech) ,2017, pp. 1118–1122.[64] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande,E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, andK. Kavukcuoglu, “Efficient neural audio synthesis,” arXiv:1802.08435[cs.SD] , Feb. 2018.[65] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,A. Courville, and Y. Bengio, “SampleRNN: An unconditional end-to-end neural audio generation model,” arXiv:1612.07837 [cs.SD] , Dec.2016.[66] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: A real-timespeaker-dependent neural vocoder,” in
Proc. International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP) , 2018, pp. 2251–2255.[67] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo,F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen,N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, andD. Hassabis, “Parallel WaveNet: Fast high-fidelity speech synthesis,” arXiv:1711.10433 [cs.LG] , Nov. 2017.[68] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wave generation inend-to-end text-to-speech,” arXiv:1807.07281 [cs.CL] , Feb. 2019.[69] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-basedgenerative network for speech synthesis,” arXiv:1811.00002 [cs.SD] ,Oct. 2018.[70] S. Kim, S. Lee, J. Song, and S. Yoon, “FloWaveNet: A generative flowfor raw audio,” arXiv:1811.02155 [cs.SD] , Nov. 2018.[71] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based waveform model for statistical parametric speech synthesis,” arXiv:1810.11946 [eess.AS] , Oct. 2018.[72] K. Tanaka, T. Kaneko, N. Hojo, and H. Kameoka, “Synthetic-to-natural speech waveform conversion using cycle-consistent adversarialnetworks,” in
Proc. IEEE Spoken Language Technology Workshop (SLT) ,2018, pp. 632–639.[73] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo,A. de Brebisson, Y. Bengio, and A. Courville, “MelGAN: Generativeadversarial networks for conditional waveform synthesis,” in
Adv. NeuralInformation Processing Systems (NeurIPS) , 2019, pp. 14 910–14 921.[74] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fastwaveform generation model based on generative adversarial networkswith multi-resolution spectrogram,” in
Proc. International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP) , 2020, pp. 6199–6203.[75] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modelingwith gated convolutional networks,” in
Proc. International Conferenceon Machine Learning (ICML) , 2017, pp. 933–941.[76] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” in
Proc. CVPR , 2017.[77] J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in
Proc. ISCA Speech Synthesis Workshop (SSW) , 2004, pp. 223–224.[78] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen-cio, T. Kinnunen, and Z. Ling, “The voice conversion challenge2018: Promoting development of parallel and nonparallel methods,” arXiv:1804.04262 [eess.AS] , Apr. 2018.[79] https://github.com/r9y9/pysptk.[80] K. Liu, J. Zhang, and Y. Yan, “High quality voice conversion throughphoneme-based linear mapping functions with STRAIGHT for man-darin,” in
Proc. International Conference on Fuzzy Systems and Knowl-edge Discovery (FSKD) , 2007, pp. 410–414.[81] K. Kobayashi and T. Toda, “sprocket: Open-source voice conversionsoftware,” in
Proc. Odyssey , 2018, pp. 203–210.[82] https://github.com/JeremyCCHsu/vae-npvc, (Accessed on 01/25/2019).[83] https://github.com/JeremyCCHsu/vae-npvc/tree/vawgan, (Accessed on01/25/2019).[84] https://github.com/k2kobayashi/sprocket, (Accessed on 01/28/2019).[85] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in
Proc. International Conference on Learning Representations (ICLR) ,2015.[86] S. Takamichi, T. Toda, A. W. Black, G. Neubig, S. Sakti, and S. Naka-mura, “Post-filters to modify the modulation spectrum for statisti-cal parametric speech synthesis,”
IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 24, no. 4, pp. 755–767, 2016.[87] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “CycleGAN-VC2:Improved cyclegan-based non-parallel voice conversion,” in
Proc. In- ternational Conference on Acoustics, Speech, and Signal Processing(ICASSP) , 2019, pp. 6820–6824.[88] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “StarGAN-VC2:Rethinking conditional methods for stargan-based voice conversion,” in Proc. Annual Conference of the International Speech CommunicationAssociation (Interspeech) , 2019, pp. 679–683.[89] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson,“AutoVC: Zero-shot voice style transfer with only autoencoder loss,”in