Many-to-Many Voice Conversion using Cycle-Consistent Variational Autoencoder with Multiple Decoders
MANY-TO-MANY VOICE CONVERSION USING CYCLE-CONSISTENT VARIATIONAL AUTOENCODER WITH MULTIPLE DECODERS
Keonnyeong Lee, In-Chul Yoo, and Dongsuk Yook
Artificial Intelligence Laboratory, Department of Computer Science and Engineering Korea University, Republic of Korea {gnl0813, icyoo, yook}@ai.korea.ac.kr
ABSTRACT
One of the obstacles in many-to-many voice conversion is the requirement of the parallel training data, which contain pairs of utterances with the same linguistic content spoken by different speakers. Since collecting such parallel data is a highly expensive task, many works attempted to use non-parallel training data for many-to-many voice conversion. One of such approaches is using the variational autoencoder (VAE). Though it can handle many-to-many voice conversion without the parallel training, the VAE based voice conversion methods suffer from low sound qualities of the converted speech. One of the major reasons is because the VAE learns only the self-reconstruction path. The conversion path is not trained at all. In this paper, we propose a cycle consistency loss for VAE to explicitly learn the conversion path. In addition, we propose to use multiple decoders to further improve the sound qualities of the conventional VAE based voice conversion methods. The effectiveness of the proposed method is validated using objective and the subjective evaluations. INTRODUCTION
Voice conversion (VC) is a task of converting the speaker-related voice characteristics in an utterance while maintaining the linguistic information. Conventional VC methods require parallel speech data for the model training. The parallel speech data contain pairs of utterances that have the same linguistic contents spoken by different speakers. However, such parallel speech data are highly expensive that they restrict the use of VC in many applications. Therefore, many recent VC approaches attempted to use non-parallel training data. Early works using non-parallel training data adopt Gaussian mixture models (GMM) [1, 2, 3]. Recently, deep learning based VC approaches that have shown promising results use cycle-consistent adversarial networks (CycleGAN) [4, 5, 6, 7, 8], variational autoencoders (VAE) [9, 10, 11, 12], and VAE with generative adversarial networks (GAN) [13, 14]. In the CycleGAN [15] based VC approaches, the speech features of a source speaker are converted to match the characteristics of a target speaker using a GAN [16], and the converted speech features are again converted back through another GAN to match the original speech features from the source speaker. By using the cycle-consistency loss [17], the linguistic contents are forced to be retained in the converted speech. However, the CycleGAN can learn only one-to-one mapping between two speakers. To achieve complete mapping among ππ speakers, ππ ( ππ β
1) 2 β CycleGAN models must be trained separately, which increases the training time and the memory space prohibitively. Though the extensions of the CycleGAN for many-to-many VC have been proposed [6, 7, 8], they do not scale well as the number of speakers increases. For example, the number of speakers used in the experiments [6, 7, 8] were at most 4. The VAE based VC approaches, on the other hand, can perform many-to-many VC for hundreds of speakers using non-parallel training data. A VAE [18] is composed of an encoder and a decoder. In the VC task, the encoder transforms the input speech features into the latent vectors containing the linguistic information of the input speech. Then, the latent vectors together with a target speaker identity vector, which is typically represented as a one-hot vector, are fed into the decoder to generate the converted speech features of the target speaker. Since the decoder is conditioned on a target speaker identity vector, it is sometimes called the conditional VAE. Though the VAE models can be trained quickly, the sound qualities of the converted speech are usually low. To improve the sound quality, a VAE and Wasserstein generative adversarial network (WGAN) [19] hybrid called variational autoencoding Wasserstein generative adversarial networks (VAEWGAN) [13] was proposed. In this method, the decoder of the VAE is considered as the generator of a WGAN in order to train the decoder better. Though VAEWGAN based VC reduces some muffled sound, the qualities of the converted speech are still unsatisfactory. One of the major drawbacks of the VAE based VC approaches is that the VAE models are not explicitly trained to convert the speech from a source speaker to a target speaker. Rather, they are trained to recover the same input speech from the source speaker using the latent vectors and the source speaker identity vector. In this paper, we propose to utilize a cycle consistency loss for the VAE to explicitly learn the mapping from a source speaker to a target speaker. To improve the sound quality further, we also propose a multi-decoder VAE which has a separate decoder for each target speaker. The cycle consistency loss and the multiple decoders can be incorporated into the VAEWGAN as well [20]. The rest of the paper is organized as follows. In Section 2, we describe the proposed methods in detail. Section 3 analyzes the experimental results, and Section 4 concludes the paper. CYCLE-CONSISTENT VAE AND VAEWGAN 2.1.
Variational Autoencoder
The loss function of the VAE is defined as follows:
VAE ( ππ , ππ ; π₯π₯ , ππ ) = π»π» KL οΏ½ππ ππ ( π§π§ | π₯π₯ ) || ππ ( π§π§ ) οΏ½ β πΌπΌ π§π§ ~ ππ ππ ( π§π§ | π₯π₯ ) οΏ½ log οΏ½ππ ππ ( π₯π₯ | π§π§ , ππ ) οΏ½οΏ½ , (1) where π»π» KL is the Kullback-Leibler divergence, ππ ππ ( π§π§ | π₯π₯ ) is an encoding model with parameter ππ that infers the linguistic information of input speech π₯π₯ , ππ ( π§π§ ) is a prior distribution for latent vector π§π§ , and ππ ππ ( π₯π₯ | π§π§ , ππ ) is a decoding model with parameter ππ that generates the reconstructed speech using π§π§ and source speaker identity vector ππ . To convert the speech from a source speaker to a target speaker, the source speaker identity vector ππ is replaced with the target speaker identity vector Y . By minimizing equation (1), the VAE is trained to reconstruct the input speech from the latent vector π§π§ and the source speaker identity vector ππ . Due to the absence of explicit model training for the conversion between the source speaker and the target speaker (i.e., only self-reconstruction training), the VAE based VC methods generally produce the converted speech with low sound quality. Variational Autoencoder with Wasserstein Generative Adversarial Network
The VAEWGAN has been proposed to improve the sound quality of the VAE based VC method. In this approach, the decoder of the VAE is the generator of the WGAN. The loss function of the WGAN is defined as follows: β WGAN ( ππ , ππ ; ππ , π₯π₯ , ππ ) = πΌπΌ π¦π¦ οΏ½π·π· ππ ( π¦π¦ ) οΏ½ β πΌπΌ π§π§ ~ ππ ππ ( π§π§ | π₯π₯ ) οΏ½π·π· ππ οΏ½πΊπΊ ππ ( π§π§ ) οΏ½οΏ½ , (2) where πΊπΊ ππ is the generator with parameter ππ , π·π· ππ is the discriminator with parameter ππ , and π¦π¦ is the speech from the target speaker represented by speaker identity vector ππ . Since the decoder of the VAE is the generator of the WGAN, πΊπΊ ππ is ππ ππ . Now, the loss function of the VAEWGAN is defined as follows: β VAEWGAN ( ππ , ππ , ππ ; π₯π₯ , ππ ) = β VAE ( ππ , ππ ; π₯π₯ , ππ ) + ππ β WGAN ( ππ , ππ ; ππ , π₯π₯ , ππ ) , (3) where ππ is the weight of the WGAN loss. Equation (3) is minimized for the VAE and the generator, and it is maximized for the discriminator. First, the VAE is trained in the same way as in Section 2.1. Second, the VAE and the WGAN are jointly trained such that the VAE gets an additional error signal from the discriminator of the WGAN. Though the VAEWGAN produces somewhat higher sound quality than the VAE, it can handle only one-to-one voice conversion. In the next sections, we propose the extensions of the VAE and the VAEWGAN, called cycle-consistent VAE (CycleVAE) and cycle-consistent VAEWGAN (CycleVAEWGAN), respectively, which can improve the performance for many-to-many voice conversion by using multiple decoders and explicitly learning many-to-many mapping functions. Cycle-Consistent Variational Autoencoder (CycleVAE)
In order to improve the sound quality of the VAE based VC, we propose to use a separate decoder for each speaker instead of a single decoder for all speakers. We also propose to use the cycle consistency loss for explicit conversion path training. The speaker identity vectors are not needed for the multi-decoder VAE since each speaker has an independent decoder. It can be expected that the sound quality can be improved since each decoder learns its corresponding speakerβs voice characteristics by the additional conversion path training while the conventional VAE must cover multiple speakers with only a single decoder by self-reconstruction training. Fig. 1 shows the concept of the CycleVAE for two speakers. When the speech π₯π₯ from speaker ππ is fed into the network, it passes through the encoder and is compressed into the latent vector π§π§ . The reconstruction error is computed using the reconstructed speech π₯π₯ ππβππβ² by the speaker ππ decoder model ππ ππ ππ . Up to this point, the loss function is similar to the vanilla VAE except that it does not require the speaker identity vectors, which is as follows: β VAE β² ( ππ , ππ ; π₯π₯ , ππ ) = π»π» KL οΏ½ππ ππ ( π§π§ | π₯π₯ ) || ππ ( π§π§ ) οΏ½ β πΌπΌ π§π§ ~ ππ ππ ( π§π§ | π₯π₯ ) οΏ½ log οΏ½ππ ππ ππ ( π₯π₯ | π§π§ ) οΏ½οΏ½ . (4) The same input speech π₯π₯ from speaker ππ goes through the encoder and the speaker ππ decoder model ππ ππ ππ as well to generate the converted speech π₯π₯ ππβππβ² which has the same linguistic contents as π₯π₯ but in speaker ππ βs voice. Then, the converted speech π₯π₯ ππβππβ² goes through the encoder and ππ ππ ππ to generate the converted back speech π₯π₯ ππβππβππβ²β² which should recover the original speech π₯π₯ . This cyclic conversion encourages the explicit training of voice conversion from ππ to ππ and ππ to ππ . The cycle consistency loss of the multi-decoder VAE is defined as follows: β cycle ( ππ , ππ ; π₯π₯ , ππ , ππ ) = π»π» KL οΏ½ππ ππ οΏ½π§π§οΏ½π₯π₯ ππβππ β² οΏ½ || ππ ( π§π§ ) οΏ½ β πΌπΌ π§π§ ~ ππ ππ οΏ½π§π§ | π₯π₯ ππβππ β² οΏ½ οΏ½ log οΏ½ππ ππ ππ ( π₯π₯ | π§π§ ) οΏ½οΏ½ . (5) Now, given the input speech π₯π₯ from speaker ππ , the loss function of the CycleVAE for two speakers can be defined as follows: β CycleVAE ( ππ , ππ ; π₯π₯ , ππ , ππ ) = β VAE β² ( ππ , ππ ; π₯π₯ , ππ ) + ππ β cycle ( ππ , ππ ; π₯π₯ , ππ , ππ ) , (6) where ππ is the weight of the cycle consistency loss. It can be easily extended for more than two speakers by summing over all pairs of the training speakers. The loss function of the CycleVAE for more than two speakers can be computed as follows for the input speech π₯π₯ from speaker ππ : Figure 1.
CycleVAE. π¦π¦ Y β X β Y β²β² π¦π¦ Y β Y β² π₯π₯ X β Y β X β²β² π₯π₯ X β X β² π¦π¦ Y β X β² π₯π₯ X β Y β² π₯π₯ Decoder ππ ππ ππ Encoder ππ ππ π¦π¦ Decoder ππ ππ ππ β β CycleVAE ( ππ , ππ ; π₯π₯ , ππ , ππ ) ππ . (7) Cycle-Consistent Variational Autoencoder with Wasserstein Generative Adversarial Network (CycleVAEWGAN)
The CycleVAE can be extended to utilize the WGAN as in the VAEWGAN case. In the CycleVAEWGAN, the decoders of the CycleVAE are shared with the generators of the WGANs. Each decoder has its own WGAN. Fig. 2 shows the concept of the CycleVAEWGAN for two speakers. Since there are multiple WGANs, equation (2) is modified as follows: β WGANβ² ( ππ , ππ ; ππ , π₯π₯ , ππ ) = πΌπΌ π¦π¦ οΏ½π·π· ππ ππ ( π¦π¦ ) οΏ½ β πΌπΌ π§π§ ~ ππ ππ ( π§π§ | π₯π₯ ) οΏ½π·π· ππ ππ οΏ½πΊπΊ ππ ππ ( π§π§ ) οΏ½οΏ½ , (8) where πΊπΊ ππ ππ is the generator with parameter ππ ππ for speaker ππ , π·π· ππ ππ is the discriminator with parameter ππ ππ for speaker ππ , and π¦π¦ is the speech from target speaker ππ . Since the decoders of the CycleVAE are the generators of the WGANs, πΊπΊ ππππ is ππ ππ ππ . Now, the loss function of the CycleVAEWGAN is defined as follows: β CycleVAEWGAN ( ππ , ππ , ππ ; π₯π₯ , ππ , ππ ) = β CycleVAE ( ππ , ππ ; π₯π₯ , ππ , ππ ) + ππ β WGAN β² ( ππ , ππ ; ππ , π₯π₯ , ππ ) + ππ β WGAN β² ( ππ , ππ ; ππ , π₯π₯ , ππ ) . (9) Note that β WGANβ² is used twice in the equation, i.e., one for the self-reconstruction path and the other for the conversion path. Equation (9) is minimized for the CycleVAE and the generators, and is maximized for the discriminators. The first stage of the CycleVAEWGAN training is identical to the training procedure of the CycleVAE. In the second stage of the training, the CycleVAE and the WGANs are jointly optimized where the CycleVAE receives the additional error signals from the WGANs. It also can be easily extended to more than two speakers by summing over all pairs of the training speakers. The loss function of the CycleVAEWGAN for more than two speakers can be computed as follows for the input speech π₯π₯ from speaker ππ : β β CycleVAEWGAN ( ππ , ππ , ππ ; π₯π₯ , ππ , ππ ) ππ . (10) The proposed CycleVAEWGAN is different from [12] in that it can utilize multiple decoders and WGANs. EXPERIMENTS
For the experiments, 2 male speakers and 2 female speakers, namely SF1, SF2, TM1 and TM2, from VCC2018 dataset [21] were used. The numbers of the training and the testing utterances per speaker were 81 and 35, respectively. The speech were down-sampled to 22.05 kHz, and 36-dimensional Mel-frequency cepstral coefficients (MFCC), aperiodicities (AP), and fundamental frequency (F0) were extracted using the WORLD speech analyzer [22]. The encoders, the decoders, and the discriminators used the gated linear units (GLU) [23]. The batch normalization [24] was applied to each convolutional neural network (CNN) [25] layers. We built our models based on [11]. Fig. 3 shows the details of the encoder, decoder, and discriminator. We used the Adam optimizer [26] with a batch size of 8. ππ and ππ were set to 0 and 1, respectively, when training the CycleVAE. For the training of the CycleVAEWGAN, ππ and ππ were set to 1. All experiments were repeated 5 times starting with randomly initialized weights. Objective Evaluations
One of drawbacks of the VAE based approaches is the over-smoothing of the generated data [27]. The global variance (GV) of MFCCs can be used to measure the degree of over-smoothing as the high GV values correlate with the sharpness
Figure 3.
The architectures of the encoder, decoder, and discriminator used in the experiments. The target speaker identity vectors (Target ID) are not used for the multi-decoder CycleVAE and CycleVAEWGAN.
Encoder I npu t O u t pu t G L U C NN B a t c h N o r m C NN Source ID
H: 36W: 128 C: 1 K: 3x9S: 1x1C: 5 K: 4x8S: 2x2C: 10 K: 4x8S: 2x2C: 10 K: 9x5S: 9x1C: 16 H: 1W: 32 C: 16C: 4 I npu t O u t pu t G L UD e C NN B a t c h N o r m D e C NN Target IDDecoder
H: 1W: 32 C: 8 K: 9x5S: 9x1C: 10 K: 4x8S: 2x2C: 10 K: 4x8S: 2x2C: 5 K: 3x9S: 1x1C: 2 H: 36W: 128 C: 2C: 4
Discriminator I npu t O u t pu t G L U C NN B a t c h N o r m C NN F u ll y C onn ec t e d Target ID
H: 36W: 128 C: 1 K: 4x4S: 2x2C: 8 K: 4x4S: 2x2C: 16 K: 4x4S: 2x2C: 32 K: 3x4S: 1x2C: 16 K: 1x1S: 1x1C: 1 H: 4W: 8 C: 1C: 4 H: 1W: 1 C: 1
Figure 2.
CycleVAEWGAN. π₯π₯π¦π¦ π·π· ππ ππ π₯π₯π¦π¦ Encoder ππ ππ Decoder ππ ππ ππ Decoder ππ ππ ππ π·π· ππ ππ f the spectra. We computed the GV for each of the MFCC indices. Fig. 4 shows the average GV over all evaluation utterances for the real speech and the converted speech by the conventional VAEWGAN and the proposed CycleVAE. The average GVs over all indices and all evaluation utterances were 0.247, 0.200, and 0.210 for the real speech and the converted speech by the VAEWGAN and CycleVAE, respectively. For the case of the original and the converted speech utterances containing the same linguistic information, the difference between the MFCCs of the two speech utterances should be small. We used two metrics to measure this difference, i.e., the Mel-cepstral distortion (MCD) [27] and the modulation spectral distance (MSD) [28]. Tables 1 and 2 show MCD and MSD, respectively, for various VC methods. Firstly, by comparing the baseline VAE and VAEWGAN columns in the tables, we confirmed that the VAEWGAN outperforms the VAE [13]. Secondly, to measure the effectiveness of the cycle consistency loss alone, we built the CycleVAE and the CycleVAEWGAN that use a single common decoder (i.e., ππ ππ ππ and ππ ππ ππ are shared in Fig. 1). The fourth and fifth columns of the tables shows these results. By comparing the second and the fourth columns (or the third and the fifth columns) of the tables, we confirmed the effectiveness of the cycle consistency loss [12]. Finally, the results of the proposed multi-decoder approaches with the cycle consistency loss are shown in the last two columns of the tables. It can be seen that the multi-decoder approaches improve the performances further. It is interesting to note that unlike the VAE or the CycleVAE having a single common decoder, adding the WGANs to the CycleVAE with multiple decoders does not improve the performance further. It is believed that because the multi-decoder cycle consistency loss is effective enough in learning the conversion path explicitly, the additional WGANs for conversion path learning may not be necessary. Subjective Evaluations
We also conducted two subjective evaluations, i.e., naturalness test and similarity test. A set of 16 utterances was selected randomly such that four utterances were assigned to each pair of F to F, M to F, F to M, and M to M conversions, where F and M represent female and male, respectively. A total of 48 utterances (16 target speakersβ utterances, 16 converted utterances by the conventional VAEWGAN, and 16 converted utterances by the proposed CycleVAE with multiple decoders) were played to 10 listeners participated in the subjective evaluations. The mean opinion score (MOS) was used for the naturalness test. The listeners evaluated the naturalness of the speech in the scales of 1 (bad) to 5 (excellent) when the utterances were played in random order. Table 3 shows that the proposed CycleVAE based VC generally exhibits higher naturalness scores than the conventional VAEWGAN based VC. In the similarity test, a target speakerβs utterance was played first, then a pair of two converted utterances by the two methods were played in random order. The listeners were asked to select
Figure 4.
Global variance of MFCCs for real speech utterances and the converted utterances by the VAE and the CycleVAE. GV MFCC Index RealVAEWGANCycleVAE
Table 1.
MCD with standard deviation.
VAE VAEWGAN CycleVAE (single decoder) CycleVAEWGAN (single decoder) CycleVAE (multi-decoder) CycleVAEWGAN (multi-decoder) F to F 7.31 Β± 0.41 7.33 Β± 0.38 7.20 Β± 0.42 7.11 Β± 0.43 Β± 0.43 7.13 Β± 0.42
Table 2.
MSD with standard deviation.
VAE VAEWGAN CycleVAE (single decoder) CycleVAEWGAN (single decoder) CycleVAE (multi-decoder) CycleVAEWGAN (multi-decoder) F to F 1.87 Β± 0.16 1.85 Β± 0.16 1.86 Β± 0.16 Β± 0.16 1.85 Β± 0.15 1.85 Β± 0.16 M to F 1.85 Β± 0.14 1.83 Β± 0.14 1.84 Β± 0.14 1.83 Β± 0.14 Β± 0.13 Β± 0.14 F to M 1.84 Β± 0.17 1.83 Β± 0.17 1.82 Β± 0.16 Β± 0.16 1.82 Β± 0.17 1.83 Β± 0.17 M to M 1.85 Β± 0.16 1.84 Β± 0.17 1.83 Β± 0.17 Β± 0.17 Β± 0.16 Β± 0.17 Average 1.86 Β± 0.16 1.84 Β± 0.16 1.84 Β± 0.16 Β± 0.16 1.83 Β± 0.15 1.83 Β± 0.16 he more similar utterance to the target speakerβs speech or βfairβ if they cannot tell the difference. Table 4 shows that the proposed CycleVAE based VC outperforms the conventional VAEWGAN based VC significantly. CONCLUSION
In this paper, we proposed the new many-to-many voice conversion methods based on the VAE. The proposed methods use multiple decoders and explicitly learn the conversion path for many-to-many voice conversion. The effectiveness of the proposed methods was validated using the objective evaluations and the subjective evaluations. The proposed methods can be further extended by utilizing multiple encoders, i.e., one encoder for each source speaker. Also, replacing the vocoder with powerful neural vocoders such as the WaveNet [29] or the WaveRNN [30] can be another future research direction. ACKNOWLEDGEMENTS
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (NRF-2017R1E1A1A01078157). Also, it was partly supported by the MSIT (Ministry of Science and ICT) under the ITRC (Information Technology Research Center) support program (IITP-2018-0-01405) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), and IITP grant funded by the Korean government (MSIT) (No. 2018-0-00269). REFERENCES Yannis Stylianou, Olivier CappΓ©, and Eric Moulines, βContinuous probabilistic transform for voice conversion,β IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 131β142, 1998. 2.
Tomoki Toda, Alan Black, and Keiichi Tokuda, βVoice conversion based on maximum-likelihood estimation of spectral parameter trajectory,β IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222β2235, 2007. 3.
Elina Helander, Tuomas Virtanen, Jani Nurminen, and Moncef Gabbouj, βVoice conversion using partial least squares regression,β IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 912β921, 2010. 4.
Takuhiro Kaneko and Hirokazu Kameoka, βCycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,β European Signal Processing Conference, pp. 2114β2118, 2018. 5.
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo, βCycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion,β IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 6820β6824, 2019. 6.
Dongsuk Yook, In-Chul Yoo, and Seungho Yoo, βVoice conversion using conditional CycleGAN,β International Conference on Computational Science and Computational Intelligence, pp. 1460β1461, 2018. 7.
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, βStarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks,β IEEE Spoken Language Technology Workshop, pp. 266β273, 2018. 8.
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo, βStarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion,β Interspeech, pp. 679β683, 2019. 9.
Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, βVoice conversion from non-parallel corpora using variational autoencoder,β Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1β6, 2016. 10.
Aaron van den Oord and Oriol Vinyals, βNeural discrete representation learning,β Advances in Neural Information Processing Systems, pp. 6309β6318, 2017. 11.
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, βACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder,β IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432β1443, 2019. 12.
Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, and Tomoki Toda, βNon-parallel voice conversion with cyclic variational autoencoder,β Interspeech, pp. 674β678, 2019.
Table 3.
Sound quality test (MOS and standard deviation).
VAEWGAN CycleVAE Target Voice F to F 2.83 Β± 0.86 Β± 0.91 4.89 Β± 0.39 M to F 2.15 Β± 0.76 Β± 0.91 F to M 2.48 Β± 0.92 Β± 0.94 4.88 Β± 0.40 M to M 2.38 Β± 0.73 Β± 0.87 Average 2.46 Β± 0.86 Β± 0.94 4.88 Β± 0.39
Table 4.
Similarity test (%).
VAEWGAN Fair CycleVAE F to F 15.0 47.5 37.5 M to F 15.0 25.0 60.0 F to M 2.5 45.0 52.5 M to M 15.0 37.5 47.5 Average 12.0 39.0 49.0 3.
Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, βVoice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,β Interspeech, pp. 3364β3368, 2017. 14.
Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, and Lin-shan Lee, βMulti-target voice conversion without parallel data by adversarially learning disentangled audio representations,β Interspeech, pp. 501β505, 2018. 15.
Jun-Yan Zhu , Taesung Park , Phillip Isola, and Alexei Efros, βUnpaired image-to-image translation using cycle-consistent adversarial networks,β IEEE International Conference on Computer Vision, pp. 2223β2232, 2017. 16.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, βGenerative adversarial nets,β Advances in Neural Information Processing Systems, pp. 2672β2680, 2014. 17.
Tinghui Zhou, Philipp Krahenbuhl, Mathieu Aubry, Qixing Huang, and Alexei Efros, βLearning dense correspondence via 3D-guided cycle consistency,β IEEE Conference on Computer Vision and Pattern Recognition, pp. 117β126, 2016. 18.
Diederik Kingma and Max Welling, βAuto-encoding variational Bayes,β arXiv:1312.6114, 2013. 19.
Martin Arjovsky, Soumith Chintala, and Leon Bottou, βWasserstein generative adversarial networks,β International Conference on Machine Learning, pp. 214β223, 2017. 20.
Keonnyeong Lee, In-Chul Yoo, and Dongsuk Yook, βVoice conversion using cycle-consistent variational autoencoder,β arXiv:1909.06805, 2019. 21.
Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling, βThe voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,β The Speaker and Language Recognition Workshop, pp. 195β202, 2018. 22.
Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, βWORLD: A vocoder-based high-quality speech synthesis system for real-time applications,β IEICE Transactions on Information and Systems, vol. 99, no. 7, pp. 1877β1884, 2016. 23.
Yann Dauphin, Angela Fan, Michael Auli, and David Grangier, βLanguage modeling with gated convolutional networks,β International Conference on Machine Learning, pp. 933β941, 2017. 24.
Sergey Ioffe and Christian Szegedy, βBatch normalization: accelerating deep network training by reducing internal covariate shift,β International Conference on Machine Learning, pp. 448β456, 2015. 25.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, βImagenet classification with deep convolutional neural networks,β Advances in Neural Information Processing Systems, pp. 1097β1105, 2012. 26.
Diederik Kingma and Jimmy Lei Ba, βAdam: A method for stochastic optimization,β International Conference on Learning Representations, 2015. 27.
Tomoki Toda, Alan Black, and Keiichi Tokuda, βVoice conversion based on maximum-likelihood estimation of spectral parameter trajectory,β IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222β2235, 2007. 28.
Shinnosuke Takamichi, Tomoki Toda, Alan Black, Graham Neubig, Sakriani Sakti, and Satoshi Nakamura, βPostfilters to modify the modulation spectrum for statistical parametric speech synthesis,β IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 755β767, 2016. 29.