Towards Low-Resource StarGAN Voice Conversion using Weight Adaptive Instance Normalization
TTOWARDS LOW-RESOURCE STARGAN VOICE CONVERSION USING WEIGHTADAPTIVE INSTANCE NORMALIZATION
Mingjie Chen, Yanpei Shi, Thomas Hain
Department of Computer Science, University of Sheffield,mchen33, yshi30, t.hain@sheffield.ac.uk
ABSTRACT
Many-to-many voice conversion with non-parallel training data hasseen significant progress in recent years. StarGAN-based modelshave been interests of voice conversion. However, most of theStarGAN-based methods only focused on voice conversion exper-iments for the situations where the number of speakers was small,and the amount of training data was large. In this work, we aim atimproving the data efficiency of the model and achieving a many-to-many non-parallel StarGAN-based voice conversion for a relativelylarge number of speakers with limited training samples. In order toimprove data efficiency, the proposed model uses a speaker encoderfor extracting speaker embeddings and conducts adaptive instancenormalization (AdaIN) on convolutional weights. Experiments areconducted with 109 speakers under two low-resource situations,where the number of training samples is 20 and 5 per speaker. Anobjective evaluation shows the proposed model is better than thebaseline methods. Furthermore, a subjective evaluation shows that,for both naturalness and similarity, the proposed model outperformsthe baseline method.
Index Terms — Voice Conversion, Generative Adversarial Net-works, Low-resource
1. INTRODUCTION
Given one voice sample of a source speaker then one voice sampleof a target speaker, voice conversion aims at generating one voicesample that contains the speech content information from the sourcesample while the target speaker’s properties. Statistical models suchas Gaussian mixture models (GMMs) [1, 2] have been used for voiceconversion. Besides, deep neural networks (DNN) [3, 4] have alsobeen popular for voice conversion. However, both the GMM-basedmodels and the DNN-based models required aligned parallel data fortraining, where source sample and target sample contain the samespeech content information. Obtaining aligned parallel data is noteasy and requires time-consuming human works. More recently,generative models such as variational auto-encoder [5, 6, 7] (VAE)and generative adversarial network (GAN) [8] have gained attentionsfor non-parallel voice conversion.In terms of GAN-based models for non-parallel voice conver-sion, CycleGAN-VC [9] used CycleGAN [10] model. A cycle-consistency loss was used in CycleGAN-VC to avoid using alignedparallel data. StarGAN-VC [11] proposed to use StarGAN [12]model for voice conversion. It used a domain classifier module,in order to enhance the similarity of converted samples. StarGAN-VC suffered from a partial conversion issue, which means the con-verted voices were neutral. Also the domain classifier module in-fluenced the voice quality. StarGAN-VC2 [13] and [14] were pro-posed to improve the performance of StarGAN-based voice conver- sion by removing the domain classifier module. StarGAN-VC2 pro-posed to use conditional instance normalization [15] to improve thespeaker adaptation ability of the model. However, feature-based nor-malization layers [16, 15, 17] have been found causing informationloss[18], which could lead to low data efficiency.Most of the mentioned StarGAN-based voice conversion re-search have used a relatively small number of speakers. For ex-ample, in StarGAN-VC and StarGAN-VC2, only 4 speakers wereused, and the amount of the training data per speaker was 5 minutes.[19] trained the StarGAN-VC model with 37 speakers, howeverthe training data per speaker was 30 minutes in average. It is un-clear whether the StarGAN-based models can keep the performancewhen increasing the number of speakers and decreasing the trainingsamples.This work aims at improving the data efficiency of the StarGAN-based model and exploring voice conversion under low-resourcesituations. We propose a weight adaptive instance normalizationStarGAN-VC (WAStarGAN-VC) model. Two approaches are usedto improve the data efficiency of the model: (1) unlike StarGAN-VCand StarGAN-VC2 only using speaker identity for target speaker in-formation, we uses a speaker encoder to extract speaker embeddingsfrom target speech; (2) instead of normalizing feature, we followthe idea from StyleGAN2 [18] and conduct adaptive instance nor-malization on the convolutional weights, to avoid information losscaused by normalization layers. The voice conversion experimentsare conducted with 109 speakers under two low-resource situations.We use speaker identification and verification for objective evalu-ation. For subjective evaluation, we evaluate the proposed modelusing ABX test (similarity) and AB test (naturalness). The evalu-ation results show that WAStarGAN-VC outperforms the baselinemodels.
2. STARGAN-BASED VOICE CONVERSION
This section reviews two previous StarGAN-based voice conversionmodels: StarGAN-VC [11] model and StarGAN-VC2 model [13].
StarGAN-VC [11] adapted and used the StarGAN [12] model forvoice conversion. The model is composed of three modules: a gen-erator G () , a discriminator D () and a domain classifier C () . Givena real data x ∼ p ( x ) ( p ( x ) is the real data distribution) and a targetspeaker identity s y , the generator converts data x to data y . y = G ( x, s y ) (1)As shown in Equation 2, the discriminator takes in a data x ∗ anda speaker identity s ∗ , where ( x ∗ , s ∗ ) can be real source data and a r X i v : . [ c s . S D ] O c t ource speaker identity ( x, s x ) or converted data and target speakeridentity ( y, s y ) . o = D ( x ∗ , s ∗ ) , (2)where o is the output of the discriminator, s x is the source speaker id. o is the probability that the input x ∗ belongs to real data distribution.The loss function of StarGAN-VC is composed of four parts: L GStarGAN − V C = L Gadv + L Gcyc + L Gid + L Gdomain (3) L DStarGAN − V C = L Dadv (4)The adversarial loss is defined as: L Gadv = − E x,s y [ D ( G ( x, s y ) , s y ))] (5) L Dadv = − E x,s x [ D ( x, s x )] − E x,s y [1 − D ( G ( x, s y ) , s y )] (6)Besides, StarGAN-VC also used the identity loss L id and the cycleconsistency loss L cyc . L Gid = E x,s x [ || x − G ( x, s x ) || ] (7) L Gcyc = E x,s y ,s x [ || x − G ( G ( x, s y ) , s x ) || ] (8)The domain classifier is used to force the generated data y to besimilar to the target speaker s y . L Cdomain = − E x,s x [ p C ( s x | x )] (9) L Gdomain = − E x,s y [ p C ( s y | G ( x, s y ))] (10) One of limitations of the StarGAN-VC model is that the domainclassifier loss hurts the voice quality [13]. Additionally, only usingthe target speaker identity s y in the generator and the discrimina-tor causes the partial conversion issue [13]. In order to solve thevoice quality issue, the StarGAN-VC2 model removed the domainclassifier module. Besides, to improve similarity, the StarGAN-VC2model used the concatenation of the source speaker embedding e x and the target speaker embedding e y as the speaker condition inputto the generator and the discriminator. e xy = concat ([ e x , e y ]) (11)where concat is the concatenation function, speaker embeddings e x an e y can be obtained through speaker ids s x and s y .StarGAN-VC2 incorporated conditional instance normalization[15] (CIN) in the generator. In the StarGAN-VC2 model, CIN nor-malizes the feature f across time and conducts affine transformationgiven the speaker condition e xy . CIN ( f ) = γ ( e xy ) ∗ ( f − µσ ) + β ( e xy ) , (12)where CIN ( f ) is the output of CIN, γ () and β () are linear func-tions, µ and σ are the mean and the standard deviation of the feature f over time.The training objective of StarGAN-VC2 is similar to StarGAN-VC, including the adversarial loss, the identity loss and the cycleconsistency loss. StarGAN-VC2 did not use the domain classifierloss. Fig. 1 . Model architecture of the proposed WAStarGAN-VC model, spk emb denotes speaker embedding
Fig. 2 . Module details of the proposed WAStarGAN-VC: spk iddenotes speaker identity, spk emb denotes speaker embedding
3. STARGAN VOICE CONVERSION WITH WEIGHTADAPTIVE INSTANCE NORMALIZATION
Given a source data x s ∼ p ( x ) and a target data x t ∼ p ( x ) , theproposed WAStarGAN-VC model is expected to generate a data y t that contains the speech content information of x s and the speakerproperties of x t . WAStarGAN-VC is composed of three modules: agenerator G () , a discriminator D () and a speaker encoder E () .Both StarGAN-VC and StarGAN-VC2 used speaker identity asthe target speaker information input. In contrast, in order to improvethe data efficiency of the model, WAStarGAN-VC uses a speakerencoder to extract speaker embeddings from target data. By doingthis, the model is expected to learn speaker embeddings more ef-ficiently. On the other hand, it has been found that normalizationlayers such as instance normalization [16] could cause informationloss [18]. WAStarGAN-VC proposes to normalize and transformconvolutional weights, to improve the data efficiency of the modelas in StyleGAN2[18]. ig. 3 . Weight adaptive instance normalization: spk emb denotesspeaker embedding, gamma and beta are affine parameters. ’GLU’denotes the activation function. B, I, J, K are batch size, outcomingchannels, incoming channels and kernel size respectively.
WAStarGAN-VC uses a 2-1-2 model architecture for the generator,which is similar to CycleGAN-VC [20] and StarGAN-VC2 [13].The generator contains three parts: the downsampling blocks, thebottleneck blocks and the upsampling blocks. As shown in Figure2, the upsampling blocks and the downsampling blocks uses 2D-convolutional layers and instance normalization [16] (IN). There are9 bottleneck blocks, where each contains a 1D-convolutional layerwith the weight adaptive instance normalization (W-AdaIN). Thegated linear units (GLU) are used as the activation function.
Adaptive instance normalization [17] (AdaIN) was initially proposedfor image style transfer tasks. Based on CIN (Equation 12), AdaINuses a speaker encoder to extract the speaker embedding e y = E ( y ) . AdaIN ( f ) = γ ( e y ) ∗ ( f − µσ ) + β ( e y ) , (13)where AdaIN ( f ) is the output of AdaIN, e y is speaker embedding, y is target data, f is feature, µ and σ are the mean and the stan-dard deviation of the feature f across time, γ () and β () are linearfunctions. This work tries to improve the data efficiency of the model by us-ing the W-AdaIN module in the bottleneck blocks of the genera-tor. In WAStarGAN-VC, as shown in Figure 3, the 1D-convolutionalweight w has the shape of [ I, J, K ] , where I is the outcoming chan-nel dimensionality of the convolutional layer, J is the incomingchannel dimensionality of the convolutional layer, K is the kernelsize.The target speaker data x t is fed into the speaker encoder to getthe speaker embedding e t = E ( x t ) . e t is fed into linear functionsto get the affine parameters γ and β . The affine parameters γ and β have the shape of [ B, J ] , where B is the batch size. Then they areexpanded on the second and the fourth dimension. γ b, ,j, , β b, ,j, = γ b,j , β b,j Then the weight w is expanded on the first dimension, where w i,j,k is the element of w . w ,i,j,k = w i,j,k Next, the expanded weight w ,i,j,k is transformed by γ b, ,j, and β b, ,j, . w ∗ b,i,j,k = γ b, ,j, ∗ w ,i,j,k + β b, ,j, (14)The transformed weights w ∗ b,i,j,k are normalized across the outcom-ing dimension ( I ). w ∗∗ b,i,j,k = w ∗ b,i,j,k − µ b, ,j,k σ b, ,j,k , (15)where w ∗∗ b,i,j,k is the output of the W-AdaIN module, µ b, ,j,k and σ b, ,j,k are the statistics of w ∗ b,i,j,k across the outcoming dimension I . Finally, the convolution is conducted on feature using the newadapted weight w ∗∗ . To get speaker-conditioned discriminator output, as in [14] andStarGAN-V2 [21], the discriminator uses N parallel speaker-conditioned output layers, where N is the number of the speakers inthe training dataset. As shown in Figure 2, in the discriminator, thefirst 4 layers are shared across N speakers. For one input sample, theswitch selects one of the speaker-conditioned output layers accord-ing to the input speaker id. Hence the output of the discriminatoris conditioned on the speaker identity. The speaker encoder alsouses the speaker-conditioned parallel output layers. Moreover, thespeaker encoder uses a statistic pooling layer as in the Xvector [22]. In WAStarGAN-VC, the training objectives include three parts: theadversarial loss, the cycle consistency loss and the speaker embed-ding reconstruction loss. As for the adversarial loss, the least squareloss [23] is used, which is the same as in StarGAN-VC2 [13]. Thecycle consistency loss is the same as in Equation 8. The speakerembedding reconstruction loss L spk tries to reconstruct the targetspeaker embedding e t from the converted data y t . L spk = E x s ,x t [ || E ( x t ) − E ( G ( x s , E ( x t ))) || ] (16)
4. EXPERIMENT IMPLEMENTATION
The experiments use VCTK [24] dataset . The VCTK dataset con-tains English speech studio recordings with 109 speakers. The aver-age number of speech samples per speaker is 400. The experiments are split into three situations according to the num-ber of speakers, the number of training samples: (1) for the firstsituation, 10 speakers with the full training samples are used, (2)for the second situation, 109 speakers with 20 samples per speakerare used, (3) for the third situation, 109 speakers with 5 samplesper speaker are used. StarGAN-VC and StarGAN-VC2 are usedas baseline methods. In case that there are no official open sourceimplementations of the StarGAN-VC model and the StarGAN-VC2model, we implemented two baseline models. The waveform datais downsampled into 22.05 kHz. Mel-cepstral coefficients (MCEPs) Source code is available at: https://github.com/MingjieChen/LowResourceVC,voice samples is available at: https://minidemo.dcs.shef.ac.uk/wastarganvc/ =10,M=Full N=109,M=20, N=109,M=5,S=900 S=5400 S=5400Model ACC EER ACC EER ACC EERStarGAN-VC 64.4 14.88 54.4 21.96 none noneStarGAN-VC2 91.5 2.99 79.6 4.61 62.6 8.27Ours . Objective evaluation results: ACC (%) denotes speakeridentification accuracy, EER (%) denotes the speaker verificationequal error rate. N is the number of speakers, M is the number train-ing samples, S is the number of converted samples for evaluation.are extracted using PyWorld [25] toolkit. The StarGAN-based mod-els only focus on the conversion of the MCEPs. As in [11] and [13],the logarithmic fundamental frequencies (F0s) are transformed lin-early. WORLD [25] vocoder is used to generate waveform basedon the converted MCEPs, the transformed F0s and the aperiodicities(APs). Finally, the loudness of the generated waveform is normal-ized using PyLoudNorm [26] toolkit.
The proposed WAStarGAN-VC model is implemented using the Py-Torch [27] toolkit. The optimizer is Adam [28] with the learning ratefor the generator and the discriminator as 2e-4 and 1e-4 respectively.The MCEPs are randomly cropped into 256-frame segments duringtraining. The batch size is 8 and the training process takes 250kiterations for 30 hours on one single GPU.
5. EXPERIMENT RESULTS
The evaluation includes objective evaluation and subjective evalua-tion. For objective evaluation, we evaluate the models on all threesituations. For subjective evaluation, we only evaluate the StarGAN-VC2 model and the WAStarGAN-VC model on the second situation.
For objective evaluation, as in [29], speaker identification accuracy(ACC) and speaker verification equal error rate (EER) are the mea-surements of the quality of the converted samples. In this work, aXvector [22] model is pretrained on the VCTK dataset for the whole109 speakers. The ACC and EER of the converted samples are usedas evaluation metrics. For the third situation where the number ofthe training samples is 5, the StarGAN-VC model failed to generatesensible voices.As shown in Table 1, generally, in all three situations, theproposed model yields the best ACC and EER results. For thefirst situation, WAStarGAN-VC gets ACC 97.0%, EER 0.66%.StarGAN-VC2 is slightly worse than WAStarGAN-VC (ACC91.5%, EER 2.99%), StarGAN-VC is much worse (ACC 64.4%,EER 14.88%). For the second situation, WAStarGAN-VC gets ACC95.9%, EER 1.77%. StarGAN-VC2 gets ACC 79.6%, EER 4.61%,and StarGAN-VC gets 54.4%, EER 21.96%. Both two baselinemodels are much worse for this situation. For the third situation,WAStarGAN-VC gets ACC 92.5%, EER 3.56%. StarGAN-VC2gets ACC 62.6%, EER 8.27%, which is much worse than the pro-posed model.The objective results show that our proposed model is slightlybetter than StarGAN-VC2 when using the full of training samples for10 speakers. However, for the low-resource situations, our proposed
Fig. 4 . Subjective evaluation resultsmodel is much better than StarGAN-VC and StarGAN-VC2. Thismaybe because the proposed model has better data efficiency, whichenables it being able to keep the performance under the low-resourcesituations.
To assess the naturalness and the similarity, this work conducts thelistening tests by comparing WAStarGAN-VC and StarGAN-VC2.The two models are trained under the second situation where thenumber of speakers is 109 and the number of training samples is 20.AB tests are used for the naturalness evaluation, where evaluatorsneed to choose one sample that has better naturalness from two sam-ples generated from two models. For the similarity evaluation, ABXtests are used. Evaluators need to choose one from two samples thatis more similar to the real target sample. A subset of 10 speakers israndomly selected (5 male and 5 female). In total 90 (10*9=90 allconversion directions) samples are evaluated for each model.As shown in Table 4, for both the naturalness and the similar-ity, the proposed model obtains the most choices. WAStarGAN-VCgets 68.4% and 46.6% choices for the naturalness and the similar-ity respectively. For the huge gap between WAStarGAN-VC andStarGAN-VC2 on the naturalness, it might because the W-AdaINmodule used in the WAStarGAN-VC model alleviates the informa-tion loss, therefore the naturalness has gained an improvement.However, for the similarity, there are 36.1% of the choices forthe option ’can not be decided’. We compute the correlations of thethree choices between the naturalness and the similarity. When thenaturalness is ’can’t be decided’, the probability of the similarity be-ing ’can’t be decided’ is 81.5%. It means that the naturalness mighthas correlations with the similarity when the naturalness is low.
6. CONCLUSION
In this work, we proposed the WAStarGAN-VC model and tried toachieve StarGAN-based voice conversion under low-resource situa-tions. The subjective and objective evaluation results show that ourproposed model has better performance than the baseline models onboth naturalness and similarity. Our future work could be one shotvoice conversion using StarGAN-based models. . REFERENCES [1] Alexander Kain and Michael W Macon, “Spectral voiceconversion for text-to-speech synthesis,” in
Proceedings ofICASSP . IEEE, 1998, vol. 1, pp. 285–288.[2] Yannis Stylianou, Olivier Capp´e, and Eric Moulines, “Con-tinuous probabilistic transform for voice conversion,”
IEEETransactions on speech and audio processing , vol. 6, no. 2, pp.131–142, 1998.[3] Srinivas Desai, Alan W Black, B Yegnanarayana, and KishorePrahallad, “Spectral mapping using artificial neural networksfor voice conversion,”
IEEE Transactions on Audio, Speech,and Language Processing , vol. 18, no. 5, pp. 954–964, 2010.[4] Elias Azarov, Maxim Vashkevich, Denis Likhachov, andAlexander A Petrovsky, “Real-time voice conversion usingartificial neural networks with rectified linear units.,” in
IN-TERSPEECH , 2013, pp. 1032–1036.[5] Romain Lopez, Jeffrey Regier, Michael I Jordan, and NirYosef, “Information constraints on auto-encoding variationalbayes,” in
NIPS , 2018, pp. 6114–6125.[6] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, andHsin-Min Wang, “Voice conversion from non-parallel corporausing variational auto-encoder,” in
APSIPA . IEEE, 2016, pp.1–6.[7] Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, and Lin-shanLee, “Multi-target voice conversion without parallel databy adversarially learning disentangled audio representations,” arXiv preprint arXiv:1804.02812 , 2018.[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio, “Generative adversarial nets,” in
NIPS , 2014,pp. 2672–2680.[9] Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-data-freevoice conversion using cycle-consistent adversarial networks,” arXiv preprint arXiv:1711.11293 , 2017.[10] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros,“Unpaired image-to-image translation using cycle-consistentadversarial networks,” in
Proceedings of ICCV , 2017, pp.2223–2232.[11] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, andNobukatsu Hojo, “Stargan-vc: Non-parallel many-to-manyvoice conversion using star generative adversarial networks,”in
SLT . IEEE, 2018, pp. 266–273.[12] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha,Sunghun Kim, and Jaegul Choo, “Stargan: Unified generativeadversarial networks for multi-domain image-to-image trans-lation,” in
Proceedings of CVPR , 2018, pp. 8789–8797.[13] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, andNobukatsu Hojo, “Stargan-vc2: Rethinking conditional meth-ods for stargan-based voice conversion,” arXiv preprintarXiv:1907.12279 , 2019.[14] Shindong Lee, BongGu Ko, Keonnyeong Lee, In-Chul Yoo,and Dongsuk Yook, “Many-to-many voice conversion usingconditional cycle-consistent adversarial networks,” in
ICASSP .IEEE, 2020, pp. 6279–6283.[15] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur,“A learned representation for artistic style,” arXiv preprintarXiv:1610.07629 , 2016. [16] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky, “In-stance normalization: The missing ingredient for fast styliza-tion,” arXiv preprint arXiv:1607.08022 , 2016.[17] Xun Huang and Serge Belongie, “Arbitrary style transfer inreal-time with adaptive instance normalization,” in
Proceed-ings of ICCV , 2017, pp. 1501–1510.[18] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,Jaakko Lehtinen, and Timo Aila, “Analyzing and improvingthe image quality of stylegan,” in
Proceedings of CVPR , 2020,pp. 8110–8119.[19] Ruobai Wang, Yu Ding, Lincheng Li, and Changjie Fan, “One-shot voice conversion using star-gan,” in
Proceedings ofICASSP . IEEE, 2020, pp. 7729–7733.[20] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, andNobukatsu Hojo, “Cyclegan-vc2: Improved cyclegan-basednon-parallel voice conversion,” in
Proceedings of the IEEE In-ternational Conference on Acoustics, Speech and Signal Pro-cessing , 2019.[21] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha,“Stargan v2: Diverse image synthesis for multiple domains,”in
Proceedings of CVPR , 2020, pp. 8188–8197.[22] David Snyder, Daniel Garcia-Romero, Gregory Sell, DanielPovey, and Sanjeev Khudanpur, “X-vectors: Robust dnn em-beddings for speaker recognition,” in
ICASSP . IEEE, 2018, pp.5329–5333.[23] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, ZhenWang, and Stephen Paul Smolley, “Least squares generativeadversarial networks,” in
Proceedings of ICASSP , 2017, pp.2794–2802.[24] Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald,et al., “Cstr vctk corpus: English multi-speaker corpus for cstrvoice cloning toolkit (version 0.92),” 2019.[25] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa,“World: a vocoder-based high-quality speech synthesis systemfor real-time applications,”
IEICE TRANSACTIONS on Infor-mation and Systems , vol. 99, no. 7, pp. 1877–1884, 2016.[26] Christian Steinmetz, “csteinmetz1/pyloudnorm: 0.1.0 (versionv0.1.0),” 2019.[27] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin,Natalia Gimelshein, Luca Antiga, et al., “Pytorch: An imper-ative style, high-performance deep learning library,” in
NIPS ,2019, pp. 8026–8037.[28] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[29] Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, and YanqiZhou, “Neural voice cloning with a few samples,” in