Adversarially learning disentangled speech representations for robust multi-factor voice conversion
AADVERSARIALLY LEARNING DISENTANGLED SPEECH REPRESENTATIONS FORROBUST MULTI-FACTOR VOICE CONVERSION
Jie Wang , Jingbei Li , Xintao Zhao , Zhiyong Wu , , ∗ , Helen Meng , Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems,Shenzhen International Graduate School, Tsinghua University, Shenzhen, China Department of Systems Engineering and Engineering Management,The Chinese University of Hong Kong, Hong Kong SAR, China { jie-wang19,lijb19,zxt20 } @mails.tsinghua.edu.cn, { zywu, hmmeng } @se.cuhk.edu.hk ABSTRACT
Factorizing speech as disentangled speech representations is vital toachieve highly controllable style transfer in voice conversion (VC).Conventional speech representation learning methods in VC onlyfactorize speech as speaker and content, lacking controllability onother prosody-related factors. State-of-the-art speech representationlearning methods for more speech factors are using primary disen-tangle algorithms such as random resampling and ad-hoc bottlenecklayer size adjustment, which however is hard to ensure robust speechrepresentation disentanglement. To increase the robustness of highlycontrollable style transfer on multiple factors in VC, we propose adisentangled speech representation learning framework based on ad-versarial learning. Four speech representations characterizing con-tent, timbre, rhythm and pitch are extracted, and further disentangledby an adversarial network inspired by BERT. The adversarial net-work is used to minimize the correlations between the speech repre-sentations, by randomly masking and predicting one of the represen-tations from the others. A word prediction network is also adoptedto learn a more informative content representation. Experimentalresults show that the proposed speech representation learning frame-work significantly improves the robustness of VC on multiple factorsby increasing conversion rate from 48.2% to 57.1% and ABX pref-erence exceeding by 31.2% compared with state-of-the-art method.
Index Terms — disentangled speech representation learning,multi-factor voice conversion, prosody control in voice conversion,adversarial learning, gradient reverse layer
1. INTRODUCTION
Voice conversion (VC) aims at converting the input speech of asource speaker to sound as if uttered by a target speaker withoutaltering the linguistic content [1]. Besides the conversion of timbre,the conversions can also be conducted in various domains such asprosody, pitch, rhythm or other non-linguistic domains. Represen-tation learning methods for these speech factors have already beenproposed and applied in many research fields in speech processing[2, 3, 4, 5, 6, 7] However, directly applying the speech represen-tations extracted by these methods in VC may cause unexpectedconversions of other speech factors as they may be not necessarilyorthogonal. Therefore, disentangling the representations of inter-mingling various informative factors in speech signal is crucial toachieve highly controllable VC [8]. * Corresponding author.
Conventionally, only speaker and content information are fac-torized in VC. Auto-encoder which is composed of an encoder anda decoder is proposed and widely used for VC [9, 10, 11] . Dur-ing training, the decoder reconstructs the speech from the speakerand content representations extracted from the encoder or other pre-trained extractors. Variational autoencoder based methods [12, 13]model the latent space of content information as Gaussian distribu-tions to pursue the regularization property. Vector quantization basedmethods [14] are further proposed to model content information asdiscrete distributions which are more related to the distribution ofphonetic information. An auxiliary adversarial speaker classifier isadopted [15] to encourage the encoder to cast away speaker informa-tion from content information by minimizing the mutual informationbetween their representations [16].To overcome the situation that prosody is also converted whilereplacing the speaker representation in conventional VC, differentinformation bottlenecks are applied to decompose the speaker infor-mation into timbre and other prosody-related factors such as rhythmand pitch [17]. To improve disentanglement, restricted sizes of bot-tleneck layers encourage the encoders to discard the informationwhich can be learnt from other bottlenecks. Random resamplingis also proposed to use in the information bottlenecks to removerhythm information from content and pitch representations.However, without explicit disentanglement modeling, randomresampling [18] and restricting the sizes of bottleneck layers canonly gain limited disentanglement of speech representations. Ran-dom resampling which is usually implemented as dividing and re-sampling speech segment using linear interpolation on time dimen-sion can only be used in removing time-related information such asrhythm. Moreover, random resampling is proved as a partial disen-tanglement algorithm that can only contaminate a random portionof the rhythm information [17]. Besides, the sizes of bottleneckslayer need to be carefully designed to extract disentangled speechrepresentations which are ad-hoc and may not be suitable for otherdatasets. And the content encoder actually is a residual encoderwhich cannot ensure that the content information is only modeledin the content representation.In this paper, to achieve robust and highly controllable styletransfer for multiple factors VC, we propose a disentangled speechrepresentation learning framework based on adversarial learning.The proposed framework explicitly removes the correlations be-tween the speech representations which characterize different factorsof speech by an adversarial network inspired by BERT [19]. Thespeech is firstly decomposed into four speech representations whichrepresent content, timbre and another two prosody-related factors, a r X i v : . [ ee ss . A S ] J a n a) Overall architecture (b) Prediction head layer Fig. 1 . Architecture of the proposed multiple factor voice conversion system with adversarially disentangled speech representation learning.rhythm and pitch. During training, one of the speech representationswill be randomly masked and inferred from the remaining represen-tations by the adversarial mask-and-predict (MAP) network. TheMAP network is trained to maximize the correlations between themasked and the remaining representations, while the speech repre-sentation encoders are trained to minimize the correlations by takingthe reversed gradient of the MAP network. In this way, the represen-tation learning framework is trained in the adversarial manner, withspeech representation encoders trying to disentangle the represen-tations while MAP network trying to maximize the representationcorrelations. A word prediction network is employed to predictword existence vector from content representations, which indicatewhether each vocabulary exists in the reference speech. The decoderreconstructs the speech from the representations during training andachieves VC on multiple factors by replacing the correspondingspeech representations.Experimental results show that the proposed speech representa-tion learning framework significantly improves the robustness of VCon multiple factors, increasing conversion rate from 48.2% to 57.1%and ABX preference exceeding by 31.2% compared to state-of-the-art speech representation learning methods for multiple factors. Fur-thermore, the proposed framework also eschews the laborious man-ual effort for sophisticated bottleneck tuning.
2. METHODOLOGY
Our proposed disentangled speech representation learning frame-work, shown in Figure 1, is composed of three sub networks: (i) mul-tiple speech representation encoders which encode speech into dif-ferent speech representations characterising content, timbre, rhythmand pitch, (ii) an adversarial MAP network that is trained to capturethe correlations between different speech representations based onthe mask-and-predict operations, (iii) an auxiliary word predictionnetwork which predicts a binary word existence vector indicatingwhether the content representation contains corresponding vocabu-lary words. Finally, a decoder is employed to synthesize speech fromthese disentangled speech representations.
Three encoders in SpeechFlow [17] are fine-tuned to extract rhythm,pitch and content representations from reference speech at frame-level. One-hot speaker labels(ID) are embedded at utterance-leveland used as the timbre representations.
An adversarial MAP network inspired by BERT [19] is designedto explicitly disentangle the extracted speech representations. Dur-ing training, one of these four speech representations is randomlymasked and the adversarial network infers the masked representa-tion from other representations. The adversarial network is com-posed of a gradient reverse layer [20] and a stack of prediction headlayers [21] which has also been used in masked acoustic modeling.Each prediction head layer is composed of a fully-connected layer,GeLU activation [22], layer normalization [23] and another fully-connected layer demonstrated in Figure 1(b). The gradient of theadversarial network is reversed by a gradient reversal layer [20] be-fore backward propagated to the speech representation encoders. L loss is adopted here to measure the adversarial loss demonstrated inthe following equations: Z = ( Z r , Z c , Z f , Z u ) (1) M ∈ { (0 , , , , (1 , , , , (1 , , , , (1 , , , } (2) L adversarial = || (1 − M ) (cid:12) ( Z − MAP(M (cid:12)
Z)) || (3)where (cid:12) is element-wise product operation, L adversarial is ad-versarial loss, Z is the concatenation of Z r , Z c , Z f , Z u denotingrhythm, content, pitch and timbre representations respectively, M is a randomly selected binary mask corresponding to the droppedregion with a value of 0 wherever the representation was droppedand 1 for unmasked representations.The MAP network is trained to predict the masked representa-tion as accurate as possible by minimizing the adversarial loss, whilein the backward propagation, the gradient is reversed which encour-ages the representations learned by the encoder contain as little mu-tual information as possible. To avoid that the content information is encoded into other repre-sentations, an auxiliary word prediction network is designed to pre-dict the existences of each vocabulary from the content representa-tion. The word prediction network is a stack of prediction head layerwhich is to produce a binary vocabulary-size vector where each di-mension indicates whether the corresponding vocabulary word existsin this sentence. The word existence vector is denoted as V word = v , v , · · · , v n ] where v i = 1 if word i is in speech, otherwise v i =0. Cross entropy loss is applied here to force the content predictionas accurate as possible: L word = − n n (cid:88) i =1 (cid:20) v i − v i (cid:21) T (cid:20) v (cid:48) i − v (cid:48) i (cid:21) (4)where the v (cid:48) i is the predicted word exist indicator, n is the size ofvocabulary. v (cid:48) i = 1 if the word i is predicted present other wise v (cid:48) i = 0.It is designed to ensure content representation more informative andavoid content information leaking into other representations. Thesimilar content-preservation strategy is used in voice conversion andtext-to-speech systems which is proved to be effective and can boostthe performance [24, 25]. The decoder in SpeechFlow [17] is employed to generate mel spec-trogram from the disentangled speech representations. During train-ing, four speech representations are extracted from the same utter-ance and the decoder is trained to reconstruct the mel spectrogramfrom the speech representations with a loss function defined as thefollowing equation: L reconstruct = (cid:107) S − ˆ S (cid:107) (5)where S and ˆ S is the mel spectrogram of the input and reconstructedspeeches respectively. The entire model is trained with a loss definedas the following equation: Loss = α ∗ L adversarial + β ∗ L word + γ ∗ L reconstruct (6)where α , β , γ are the loss weights for adversarial loss, word pre-diction loss and reconstruction loss respectively. To improve therobustness of our proposed framework, the loss weight for the re-construction loss is designed to be exponential decaying.
3. EXPERIMENT3.1. Training setup
The experiments are performed on the CSTR VCTK corpus, whichcontains audio data produced by 109 speakers in English. Werandomly select a subset of 10 females and 10 males. After pre-processing, the corpus for experiment contains 6471 sentences intotal, 5176 sentences for training, 647 sentences for validation and285 sentences for testing.All the audios are down-sampled to 16000Hz. Mel spectrogramsare computed through a short time Fourier transform (STFT) usinga 50 ms frame size, 12.5 ms frame hop, and a Hann window func-tion. We transform the STFT magnitude to the mel scale using an80 channel mel filterbank spanning 125 Hz to 7.6 kHz, followed bylog dynamic range compression. The filterbank output magnitudesare clipped to a minimum value of 0.01. The weights of adversarialloss and word prediction loss are fixed to − and − respec-tively. The weight of reconstruction loss γ applies an initial weightof with decay factor of 0.9 every 200,000 steps. We train a vanillaSpeechFlow [17] as the baseline approach on the same training andvalidation sets.We program all neural networks used in the experiments basedon an open source pytorch implemention of SpeechFlow [17]. Wetrain all models with a batch size of 16 for 500,000 steps using theADAM optimizer with learning rate fixed to − on a NVIDIA 2080Ti GPU. We use a pretrained wavenet vocoder on VCTK corpus[26] to synthesize the audios from the spectrogram. The demo isavailable https://thuhcsi.github.io/icassp2021-multi-factor-vc/. Mel-cepstral distortion (MCD) is calculated on a sub set of the test-ing set which consists 300 parallel conversion pairs of 155 sentencesincluding inter-gender and intra-gender converison. The audios inthe test set are perceptually distinct in pitch and rhythm. MCD isdefined as the Euclidean distance between the predicted mel spec-trogram and the that of target speech. The results is demonstratedin Table. The MCD compariosn is shown in Table 1. The proposedvoice conversion system outperforms the baseline with decreasingthe MCD from 4.00 to 3.94.
Table 1 . MCD comparison between different approaches.
Baseline Proposed
MCD 4.00 3.94
Table 2 . ABX comparison between proposed and baseline ap-proaches. PR refers to preference rate.
Baseline Proposed Neutral
PR 20.6% 51.8% 27.6%
We perform the ABX test on 20 utterances selected from the testingset in terms of similarity between the converted and reference speechwhen different factors of speech are converted. The listeners are pre-sented with the target utterance and the factors which are convertedand asked to select the most similar speech from the ones synthesizedfrom different systems in random order. As shown in table 2, ourproposed model outperforms the baseline with 31.2% higher on av-erage. It means that while converting the same aspect, the proposedframework endows the voice conversion system a strong disentan-glement and conversion ability. It also improves the interpretabilityas the results shows a distinct outstanding conversion results.We conduct another subjective evaluation to measure the con-version rate of different approaches. The listeners are presented withboth the source and target utterances in random order and a randomsynthesized speech. The listeners are asked to select the convertedspeech is more similar to the source or the target utterance for eachspeech factor converted in the synthesized speech. For each speechfactor, listeners are asked to choose whether the converted speech ismore similar to the source or target utterance individually. It meansthat the conversion rates of different speech factors are evaluated in-dependently and not influenced by each other. The conversion rate isdefined as the percentage of answers that choose the target utterance[17].As shown in Table 3, our proposed model outperforms the base-line in the most conversion conditions which means a highly con-trollable voice conversion. able 3 . Conversion rate comparison between proposed and baselineapproaches.
Baseline Proposed conversion rate 48.2% 57.1%(a) Remove Content (b) Remove Rhythm(c) Remove Pitch (d) Remove Timbre
Fig. 2 . Reconstructed Mel spectrogram when one component is re-moved of the sentence “I must do something about it.” of the Base-line system.
To further show the disentanglement performance of our proposedframework, we generate mel spectrograms with four speech factorsremoved by set the corresponding input as zero [17] as shown inFigure 2 and Figure3. Figure 2 shows the reconstructed mel spectro-grams of the baseline system and Figure 3 shows the results of theproposed system. Take content removed as an example as shown inFigure 2(a) and 3(a), after the content information is removed, thespectrogram of the proposed system is composed of more uninfor-mative blanks. It can be observed that the proposed system removesthe content information more thoroughly than the baseline whichmeans that in the proposed system, the amount of content informa-tion leaking into other encoder is less than baseline system. Thepitch information is preserved more in the proposed system as it isless flat than baseline approach as annotated in Figure 2 and 3.When the rhythm is removed, both the reconstructed mel spec-trograms of the two systems are blank except that there is a brightline in the Figure 2(b) indicating that partial rhythm informationis encoded by other encoders. When the pitch is removed, thepitch contour of the reconstructed speech generated by the proposedsystem retains the curve but is flatter than that of baseline. Whenthe timbre is removed, both the formant position shift indicates thespeaker identity changes. When one of the four speech factors isset zero, the proposed system not only removes the correspondinginformation more thoroughly but also keeps other information un-damaged which shows that the proposed system achieves a betterdisentanglement. (a) Remove Content (b) Remove Rhythm(c) Remove Pitch (d) Remove Timbre
Fig. 3 . Reconstructed Mel spectrogram when one component is re-moved of the sentence “I must do something about it.” of the Pro-posed system.
Ablation studies are conducted to validate the effectiveness of theword prediction network. For investigating the effects, we train theproposed model but without the word prediction network. As shownin Table 4, the reconstruction loss decreases from 21.5 to 12.8 andthe adversarial loss decreases from 0.016 to 0.015 on training setafter applying the word prediction network.
Table 4 . Effect of Word prediction network on reconstruction lossand adversarial loss reduction.
Proposed with Proposed-withoutword prediction word prediction
Reconstruction loss
4. CONCLUSION
In order to increase the robustness of highly controllable style trans-fer on multiple factors in VC, we propose a disentangled speech rep-resentation learning framework based on adversarial learning. Weextract four speech representations which characterizing content,timbre, rhythm and pitch, and we employ an adversarial networkinspired by BERT to further disentangle the speech representations.We employ a word prediction network to learn a more informativecontent representation. Experimental results show that the proposedspeech representation learning framework significantly improves therobustness of VC on multiple factors. Different masking strategieswill be explored in the future work. . REFERENCES [1] Jing-Xuan Zhang, Zhen-Hua Ling, Yuan Jiang, Li-Juan Liu,Chen Liang, and Li-Rong Dai, “Improving sequence-to-sequence voice conversion by adding text-supervision,” in
ICASSP 2019-2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp.6785–6789.[2] Zhen Zeng, Jianzong Wang, Ning Cheng, and Jing Xiao,“Prosody learning mechanism for speech synthesis systemwithout text length limit,” arXiv preprint arXiv:2008.05656 ,2020.[3] Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matˇejka,and Oldˇrich Plchot, “But system description to vox-celeb speaker recognition challenge 2019,” arXiv preprintarXiv:1910.12592 , 2019.[4] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio LopezMoreno, and Javier Gonzalez-Dominguez, “Deep neural net-works for small footprint text-dependent speaker verification,”in . IEEE, 2014, pp. 4052–4056.[5] MW Mak, “Lecture notes on factor analysis and i-vectors,”
Dept. Electron. Inf. Eng., The Hong Kong Polytechnic Univ.,Hong Kong , 2016.[6] David Snyder, Daniel Garcia-Romero, Gregory Sell, DanielPovey, and Sanjeev Khudanpur, “X-vectors: Robust dnn em-beddings for speaker recognition,” in . IEEE, 2018, pp. 5329–5333.[7] Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray, “Sur-vey on speech emotion recognition: Features, classificationschemes, and databases,”
Pattern Recognition , vol. 44, no. 3,pp. 572–587, 2011.[8] Lantian Li, Dong Wang, Yixiang Chen, Ying Shi, ZhiyuanTang, and Thomas Fang Zheng, “Deep factorization for speechsignal,” in . IEEE, 2018, pp.5094–5098.[9] Dongsuk Yook, Seong-Gyun Leem, Keonnyeong Lee, andIn-Chul Yoo, “Many-to-many voice conversion using cycle-consistent variational autoencoder with multiple decoders,” in
Proc. Odyssey 2020 The Speaker and Language RecognitionWorkshop , 2020, pp. 215–221.[10] Wen-Chin Huang, Hao Luo, Hsin-Te Hwang, Chen-Chou Lo,Yu-Huai Peng, Yu Tsao, and Hsin-Min Wang, “Unsuper-vised representation disentanglement using cross domain fea-tures and adversarial learning in variational autoencoder basedvoice conversion,”
IEEE Transactions on Emerging Topics inComputational Intelligence , 2020.[11] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, andHsin-Min Wang, “Voice conversion from non-parallel corporausing variational auto-encoder,” in . IEEE, 2016, pp. 1–6.[12] Wen-Chin Huang, Hsin-Te Hwang, Yu-Huai Peng, Yu Tsao,and Hsin-Min Wang, “Voice conversion based on cross-domain features using variational auto encoders,” in . IEEE, 2018, pp. 51–55. [13] Mohamed Elgaar, Jungbae Park, and Sang Wan Lee, “Multi-speaker and multi-domain emotional voice conversion us-ing factorized hierarchical variational autoencoder,” in
ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.7769–7773.[14] Da-Yi Wu and Hung-yi Lee, “One-shot voice conversion byvector quantization,” in
ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 7734–7738.[15] Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, and Lin-shanLee, “Multi-target voice conversion without parallel databy adversarially learning disentangled audio representations,” arXiv preprint arXiv:1804.02812 , 2018.[16] Orhan Ocal, Oguz H Elibol, Gokce Keskin, Cory Stephen-son, Anil Thomas, and Kannan Ramchandran, “Adversari-ally trained autoencoders for parallel-data-free voice conver-sion,” in
ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 2777–2781.[17] Kaizhi Qian, Yang Zhang, Shiyu Chang, David Cox, andMark Hasegawa-Johnson, “Unsupervised speech decom-position via triple information bottleneck,” arXiv preprintarXiv:2004.11284 , 2020.[18] Adam Polyak and Lior Wolf, “Attention-based wavenet au-toencoder for universal voice conversion,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2019, pp. 6800–6804.[19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova, “Bert: Pre-training of deep bidirectionaltransformers for language understanding,” arXiv preprintarXiv:1810.04805 , 2018.[20] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Ger-main, Hugo Larochelle, Franc¸ois Laviolette, Mario Marchand,and Victor Lempitsky, “Domain-adversarial training of neuralnetworks,”
The Journal of Machine Learning Research , vol.17, no. 1, pp. 2096–2030, 2016.[21] Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, andHung-yi Lee, “Mockingjay: Unsupervised speech representa-tion learning with deep bidirectional transformer encoders,” in
ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.6419–6423.[22] Dan Hendrycks and Kevin Gimpel, “Gaussian error linear units(gelus),” arXiv preprint arXiv:1606.08415 , 2016.[23] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton,“Layer normalization,” arXiv preprint arXiv:1607.06450 ,2016.[24] Fadi Biadsy, Ron J Weiss, Pedro J Moreno, Dimitri Kanevsky,and Ye Jia, “Parrotron: An end-to-end speech-to-speech con-version model and its applications to hearing-impaired speechand speech separation,” arXiv preprint arXiv:1904.04169 ,2019.[25] Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, andNobukatsu Hojo, “Atts2s-vc: Sequence-to-sequence voiceconversion with attention and context preservation mecha-nisms,” in
ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 6805–6809.26] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang,and Mark Hasegawa-Johnson, “Autovc: Zero-shot voicestyle transfer with only autoencoder loss,” arXiv preprintarXiv:1905.05879arXiv preprintarXiv:1905.05879