[PDF] EmoCat: Language-agnostic Emotional Voice Conversion

Abstract

Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with less than 45 minutes of German emotional recordings by exploiting large amounts of emotional data in US English. EmoCat is an encoder-decoder model based on CopyCat, a voice conversion system which transfers prosody. We use adversarial training to remove emotion leakage from the encoder to the decoder. The adversarial training is improved by a novel contribution to gradient reversal to truly reverse gradients. This allows to remove only the leaking information and to converge to better optima with higher conversion performance. Evaluations show that Emocat can convert to different emotions but misses on emotion intensity compared to the recordings, especially for very expressive emotions. EmoCat is able to achieve audio quality on par with the recordings for five out of six tested emotion intensities.

Full PDF

EEMOCAT: LANGUAGE-AGNOSTIC EMOTIONAL VOICE CONVERSION

Bastian Schnell ‡∗ , Goeric Huybrechts † , Bartek Perz † , Thomas Drugman † , Jaime Lorenzo-Trueba †‡ Idiap Research Institute, Martigny, Switzerland † Amazon, TTS Research, Cambridge, United Kingdom [email protected], { huybrech,perzbart,drugman,truebaj } @amazon.com ABSTRACT

Emotional voice conversion models adapt the emotion in speechwithout changing the speaker identity or linguistic content. They areless data hungry than text-to-speech models and allow to generatelarge amounts of emotional data for downstream tasks. In this workwe propose EmoCat, a language-agnostic emotional voice conver-sion model. It achieves high-quality emotion conversion in Germanwith less than 45 minutes of German emotional recordings by ex-ploiting large amounts of emotional data in US English. EmoCatis an encoder-decoder model based on CopyCat, a voice conversionsystem which transfers prosody. We use adversarial training to re-move emotion leakage from the encoder to the decoder. The adver-sarial training is improved by a novel contribution to gradient rever-sal to truly reverse gradients. This allows to remove only the leakinginformation and to converge to better optima with higher conversionperformance. Evaluations show that Emocat can convert to differentemotions but misses on emotion intensity compared to the record-ings, especially for very expressive emotions. EmoCat is able toachieve audio quality on par with the recordings for ﬁve out of sixtested emotion intensities. Index Terms — Voice Conversion, Emotional Speech, SpeechSynthesis, Expressive TTS, Text-to-Speech

1. INTRODUCTION

Neural Text-to-Speech (TTS) has greatly supported the advent of ar-tiﬁcial voice assistants like Amazon Alexa, Google Assistant, or Siri.These systems are trained on tens of hours of data [1] and producehigh-quality speech with close to perfect intelligibility [2]. However,their speech is mostly neutral, which prevents natural conversationsand closer bounds with the user. Creating voices in more expressivespeaking styles usually requires recording similarly large amounts ofspeech for the desired style. This is very time-consuming and costly.An alternative is the generation of synthetic data to satisfy the highdata needs. The conversion of speech is generally assumed to beeasier than TTS, thus has lower data needs.Emotional voice conversion (EVC) is a subﬁeld of voice conver-sion (VC) which studies the transformation of a source audio signalinto a different emotion while maintaining its linguistic content andspeaker identity. Techniques applied in EVC are similar to VC anddiffer mostly in their feature selection [3, 4]. EVC techniques work-ing without hand-crafted features are applicable to other speakingstyles as well. EVC is also applied to other tasks like ﬁlm dubbing.In this work, we aim to convert neutral to emotional speech inGerman. As we have only a limited amount of emotional Germandata available, we exploit emotional recordings in US English. We ∗ Work performed while being an intern at Amazon. Audio samples will be released after paper acceptance. propose EmoCat, a language-agnostic EVC model trained jointlyon German and US English working directly on mel-spectrograms.Compared to other works we use mel-spectrograms to leverage ourhigh-quality universal vocoder [5] to keep a high bar on segmentalquality. Our model adapts the CopyCat model [6] (which is basedon AutoVC [7]) for intra-speaker emotion conversion. CopyCat isa VC model which allows to convert the speech of unseen speakersto a set of target speakers. In contrast to the global speaker identity,emotion is a continuous component of speech. We use adversar-ial training to explicitly remove emotion leakage from the encoder,which encodes the neutral source spectrogram, to the decoder, whichgenerates the converted emotional spectrogram. We propose a novelimprovement to gradient reversal [8] to stabilise its gradients. Wefurther investigate ﬁne-tuning to improve naturalness. In an ablationstudy, we assess the effectiveness of each of the techniques. Theproposed model is able to convert neutral German to two differentemotions in three intensities with the support of less than 45 minutesof German emotional data. To the best of our knowledge, no workexists on EVC with multi-lingual data or mel-spectrograms.

2. RELATED WORK

Emotional voice conversion methods are generally split into two cat-egories: parallel and non-parallel training data. In the parallel datascenario, the database contains the same utterance spoken by thesame speaker in the different target emotions.In the non-parallel data scenario, the utterances for each emotiondiffer,meaning that the content can better match the emotion. Thisallows a wider variety of utterances and also simpliﬁes acting forthe voice talents. With the lack of parallel data, a model cannot betrained to do the conversion directly as the ground truth target is notavailable. The training can only be guided in an unsupervised way.Generative adversarial networks (GAN) and cycle consistency lossesare commonly used techniques here [9, 4, 10].In [9] an encoder-decoder structure with a content and styleencoder is used to convert mel-cepstrum (MCEP) extracted byWORLD [11] . The model is trained with three losses. First, thecepstrogram is auto-encoded and an L1 reconstruction loss applied.Second, a semi-cycle consistency L1 loss forces the encoder embed-dings to match before and after conversion. Third, a GAN loss triesto discriminate generated from recorded samples. F is convertedby a linear transform to match the statistics of the target emotiondomain. The band aperiodicities remain unchanged.StarGAN is used in [4] on WORLD features with a reconstruc-tion loss, an L1 cycle consistency loss, and a real/fake GAN loss.The model architecture is the same as in the original StarGAN-VCpaper [3]. An emotion recognition model (a variant of [12]) wastrained with the generated samples and evaluations show that its ac-curacy improved. a r X i v : . [ ee ss . A S ] J a n ig. 1 . Structure of the encoder-decoder EmoCat model with a gradi-ent inverter block followed by an emotion classiﬁer to remove emo-tion information in the bottleneck embeddings. The plus sign de-notes a concatenation.CycleGAN has also been used for emotion conversion [10]. Itis trained with three losses: 1) a reconstruction loss, 2) a cycle-consistency loss on a sample converted to another emotion and thenback to the source emotion, and 3) the GAN loss for real/fake dis-crimination. The experiments show that separate CycleGANs for F0and MCEP outperform a joint model. [13] follows a very similarapproach but uses an additional emotion classiﬁcation loss and noreconstruction loss.A different approach is the variational auto-encoding Wasser-stein GAN (VAW-GAN) for emotion conversion [14] (originally pro-posed for VC in [15]). It consists of a variational auto-encoder(VAE) structure where the decoder is conditioned on an emotion em-bedding. The latent dimension is chosen to be small enough so thatit will not contain emotion information. The model is trained withreconstruction loss, standard Kullback-Leibler (KL) divergence onthe VAE latent space, and a Wasserstein GAN loss. Instead of usinga binary cross-entropy loss for the real/fake prediction of the dis-criminator, the Wasserstein distance is used.Our approach is closest to the VAW-GAN in [14] as it employsa similar encoder-decoder structure with a VAE encoder. However,the bottleneck used is temporal and drastically smaller, also we con-dition the decoder on the linguistic content. In contrast to all relatedwork above, we operate on mel-spectrograms and train with multi-lingual data.

3. MODEL DESCRIPTION

In this section we introduce EmoCat, a language-agnostic intra-speaker emotion conversion model. It aims to convert neutral speechto emotional speech of the same speaker . EmoCat is based onCopyCat [6] and inherits the same structure and hyper-parametersexcept four differences:1. It uses 64-dim emotion embeddings instead of 128-dimspeaker embeddings (see Section 3.1).2. It uses a gradient inverter block to remove emotion leakagefrom the bottleneck embeddings (see Section 3.2).3. It operates on multi-lingual data (see Section 4.1).4. It does not pass the phoneme embeddings to the VAE refer-ence encoder.Figure 1 shows the network structure. The VAE reference encoderencodes the mel-spectrograms and its emotion embedding. A di-mensional and temporal bottleneck is applied by only selecting ev-ery N-th frame [7]. Each selected frame is copied N times (to restore We have informally veriﬁed that it also allows conversion between emo-tions, but this lies out of the main scope of this paper. the sequence length). The bottleneck embeddings should contain asmuch information as possible to generate high-quality speech butno emotion information. This is ensured by passing them throughthe gradient inverter to the emotion classiﬁer, which removes anyleaking emotion information. Force-aligned upsampled phonemes(procedure described in [16]) are encoded by the phoneme encoderto produce phoneme embeddings. During inference, the bottleneckand phoneme embeddings are stacked with the target-emotion em-bedding centroid and consumed by a parallel decoder to produce theconverted mel-spectrograms. During training, the oracle utterance-level emotion embedding is used on the encoder and decoder side.Source and target spectrograms are the same as well. The paralleldecoder consists of a stack of three convolutional layers followedby a uni-directional long short-term memory (LSTM). The model istrained with an L1 reconstruction loss and the KL-loss on the VAElatent space. For the detailed architecture, please refer to the originalCopyCat paper [6].

During training, utterance-level emotion embeddings are fed tothe VAE reference encoder and the parallel decoder. The emotionembeddings need to be organised language-independently by theirstyle and other latent information to be beneﬁcial to the model.This excludes simple embeddings per emotion class and suggests alearn-able approach. Thus we obtain them from a Tacotron-like TTSmodel [17] with the addition of two VAE reference encoders [18].One reference encoder captures the speaker information while theother captures the emotion. We use intercross training [19] to guideeach encoder to encode only the speaker/emotion information andto be language-independent. We use the predicted embeddings fromthe emotion reference encoder as utterance-level emotion embed-dings for the EmoCat training. We could learn the emotion embed-dings in a similar fashion on-the-ﬂy within the EmoCat model, butthis would increase its training time, which is not desirable duringresearch. We could also obtain them from a simple emotion recog-nition model, but we hypothesised that those embeddings might bemore suited for recognition than generation.For the CopyCat model, robust speaker embeddings from apre-trained speaker identiﬁcation system are necessary, because themodel also has to convert from unseen speakers. This is not the casefor the EmoCat model, which only converts between seen emotions.Thus it requires less sophisticated emotion embeddings.During inference, the utterance-level emotion embedding of theconverted spectrograms is unknown. Instead we compute the cen-troid for each emotion over all emotion embeddings extracted fromthe training set and feed it to the decoder. The VAE reference en-coder still uses the utterance-level emotion embedding of the inputaudio.

As emotion is a continuous and integral part of speech, it is necessaryto explicitly prevent it from leaking from the encoder to the decoderside. With a pre-trained EmoCat with frozen weights we trained in-dependent gated recurrent unit (GRU) emotion classiﬁers to predictthe source emotion from the bottleneck embeddings, where the bestachieved 64% overall accuracy. We found that heavy leakage re-sulted in low emotion intensity during conversion. Decreasing thebottleneck (as described in AutoVC [7]) led to heavy degradation insignal quality and intelligibility. With the reconstruction loss alone,we could not force the bottleneck embeddings to remove the un-desired emotion information while keeping information needed forhigh signal quality.nstead we used a gradient reversal block before the emotionclassiﬁer during training to actively remove emotion leakage fromthe bottleneck embeddings. The idea of gradient reversal is to re-verse the gradients during back-propagation to remove any activa-tion in the input that helps the following classiﬁer. Gradient reversalachieves this by swapping the sign of the gradient ∆ (Equation 1).It also applies a weight λ to control the impact of the gradient onthe preceding layers. The choice of the weight greatly inﬂuences theperformance of the ﬁnal model. ∆ (cid:48) = − λ ∆ (1)We experimented with a feed-forward and a GRU based emotionclassiﬁer. Interestingly, EmoCat converged to a better model in termsof conversion ability with the feed-forward classiﬁer than the GRUone. This suggests that with gradient reversal even a weak classiﬁergives sufﬁcient gradients to lead to a better convergence point.We again trained the same emotion classiﬁer as above on the bot-tleneck embeddings of the model with gradient reversal. The classi-ﬁer mainly predicted the majority class (95% of the time) showingthat the majority of the emotion leakage was removed. Informal lis-tening veriﬁed that the conversion ability of the model improved.We argue that a simple swap of the sign (Equation 1) fulﬁls onlyhalf of the reversal purpose. Consider the following two scenarios:1. Imagine there is no leakage in the input. As the classiﬁercannot rely on any information in the input, its prediction israndom and the cross-entropy loss on its predictions is high.Thus the back-propagated gradients are large as well. Eventhough there is no leakage the preceding network receives alarge reversed gradient.2. Imagine there is signiﬁcant leakage in the input and the clas-siﬁer is already properly trained. Then its prediction is good,the cross-entropy loss is low, and the back-propagated gradi-ents are small. Even though there is signiﬁcant leakage thepreceding network receives only a small reversed gradient.The desired effect on the preceding network in both scenarios shouldbe swapped. Without any leakage the received gradients should besmall, while with signiﬁcant leakage the gradients should be large.To address this issue, we present the gradient inverter block.Instead of only swapping the sign of the gradient, it performs aproper inversion by also converting small gradients to large onesand vice versa. We have experimented with two gradient inverterfunctions. ∆ (cid:48) = − λ ∆ || ∆ || Inverse square norm (2) ∆ (cid:48) = − λ ∆exp || ∆ || Inverse exp square norm (3)Equation 2 implements directly what we want to achieve by scal-ing the gradient by its squared norm. Gradients with a norm smallerthan one will become greater than one and vice versa. However, itmight lead to unstable behaviour as gradients with a norm close tozero are scaled towards inﬁnity. Equations 3 prevents this by bound-ing the denominator to less than one. In this variant, gradients with asmall norm remain almost unchanged while big gradients are quicklyfaded out. We found that depending on the target emotion one of theproposed inverter functions performs better.

While the EmoCat model with the proposed gradient inverterachieved high emotion intensities, its signal quality left room for improvement. We investigated ﬁne-tuning on a subset of the train-ing data. First the model was trained with all data until convergence.Then we continued training on emotional and similar amounts ofneutral data. This should compensate the averaging effect in thedecoder introduced by the huge amount of neutral training data.We did not change any hyper-parameters, learning rates, or lossescompared to the ﬁrst training step. This approach outperforms aGAN-like loss (same as used for CopyCat [6]), which strives for thegenerated spectrogram to be indistinguishable from the recordings.

4. EXPERIMENTS

We aim at generating emotional German samples by converting fromneutral using a model trained with a limited amount of emotionalGerman data. We focused on two emotions: excited and disap-pointed, in three intensities: low, medium, high.

We use two internal databases. For German, we use more than 20 hof neutral and 45 min of emotional single-speaker recordings of a fe-male voice. 20 neutral samples are set aside as test set. We do not usea development set to guide the training because the L1 reconstruc-tion loss does not match human perception. The 45 min of emotionaldata are split equally into excited and disappointed. 25% is low, 50%medium, and 25% high intensity. Excluding the test set, we havearound 5 min for the most challenging intensity: high. As we do nothave access to more emotional German data, we use recordings of afemale US English voice as supporting speaker. From this speaker,we use more than 20 h of neutral and more than 10 h of emotionalrecordings of the same emotion categories. We found that includ-ing US English data greatly improved the conversion abilities of ourmodel, despite the differences in language. 24 kHz recordings areused. We trim all silences to be maximum 100 ms and extract 80-dim mel-spectrogram. We use phonemes with fully disjoint sets forEnglish and German, thus the speaker identity can directly be in-ferred and explicit speaker embeddings are unnecessary.

We conduct an ablation study across three models. Each is trainedfor 100k steps on the combined two databases. The mel-spectrogramis synthesised with our universal vocoder [5].1.

Grad. reversal - This model uses the vanilla gradient re-versal block (Equation 1) to remove leaking emotion infor-mation. In contrast to the following two models, we used aweighted cross entropy loss for the adversarial emotion clas-siﬁer to compensate for the huge class imbalance in the train-ing data. We chose the weights inverse proportional to theamount of the emotion in the total training data. We foundthat this improved the grad. reversal model.2.

Grad. inverter - This model replaces the gradient reversalblock of model 1 with the improved gradient inverter block(see Section 3.2). We use two separate models for the con-version. The model to convert to the three excited emotionsuses the inverse exp square norm function (Equation 3), whilethe one to convert to disappointed uses inverse square norm(Equation 2). This was selected based on a clear performancedifference in informal listening.3.

Fine-tuning - This is model 2 ﬁne-tuned for 2k steps as de-scribed in Section 3.3. The best results were obtained by ﬁne-tuning on the emotional data of the target speaker with a sim-ilar amount of neutral data as for each emotion. The neutral ig. 2 . System descriptions: blue: grad. reversal, orange: grad. inverter, green: grad. inverter ﬁne-tuned, red: neutral baseline, purple:recordings. Black horizontal bars connecting systems denote no statistically signiﬁcant difference between them ( p-value < . ).data requirement is probably due to the adversarial training.This simple ﬁne-tuning outperforms GAN ﬁne-tuning.We wanted to include a state-of-the-art baseline, however we didnot ﬁnd any work on emotion conversion from spectrograms. Weadapted the work of [4] based on their StarGAN implementation to use mel-spectrograms instead of WORLD vocoder features, butthe quality of the synthesized speech was very low. It is likely thatmajor adaptations to the model architecture are necessary to achievecompetitive results. However, creating such a baseline system is outof scope for this work. Therefore it was impossible for us to includea competitive state-of-the-art baseline model in our benchmark. We randomly selected 10 neutral German samples from the held-outtest set and converted them to each of the six emotion intensities.24 native German listeners rated the samples in terms of emotionintensity and audio quality in a MUSHRA [20] test from 0 to 100.

We asked listeners to rate the emotion intensity where we providedanother neutral recording (different sentence) as a reference of 0.We also included another recoding of the same emotion of the targetspeaker as an upper anchor and the utterance generated by a neutralbaseline system. We see in Figure 2 top line that our gradient invertermodel outperforms vanilla gradient reversal for medium excited andis similar in high excited (no statistical difference, two-tailed t-testwith p-value < . , denoted as a horizontal bar in the plots) while itis signiﬁcantly worse for low excited. The exp square norm function(Equation 3) only scales large gradients down which does not seemto be optimal for the excited intensities. For disappointed the gradi-ent inverter model achieves more than 20 MUSHRA points higherscore across all intensities, proving the improvement through thegradient inverter function. We either did not yet ﬁnd a gradient in-verter function which generalises to different emotions, or the func-tion should be chosen depending on the use case. Fine-tuning lowersthe emotion intensity for the medium and high emotions. This showsan averaging effect of the neutral and low intensity data. It shouldalso be noted that we see a clear ascent from the low to the highintensity, but do not yet reach the emotion intensity of the record-ings except for low disappointed. We were only able to partiallyaddress the averaging effect in the decoder, which might reveal ageneral shortcoming of current decoder architectures. Highly ex-pressive data in another language seems to improve the system only https://github.com/glam-imperial/EmotionalConversionStarGAN to a certain point. More high expressive German recordings, evenfrom other speakers, might push the emotion intensity further. We compared the same systems as above but without a referencesample and asked the listeners to rate the audio quality (Figure 2bottom line). We do not see a statistical difference between all sys-tems for medium and high excited. Vanilla gradient reversal out-performs both other techniques for low excited and all disappointedintensities, but they are still at par with the recordings. We see atrade-off between emotion intensity and audio quality here. We usu-ally found that higher emotion intensities suffer from reduced signalquality. Most likely because low intensities are close to neutral sam-ples for which we have a lot of training data. This leads back tothe averaging effect in the decoder. We suggest to explore differentdecoder architectures more suitable for highly expressive speakingstyles. While we are not able to reach the emotion intensity of therecordings yet, we achieve high audio quality at a generally lowerintensity level. Fine-tuning did not achieve the desired improvementin audio quality. Even though it increased the MUSHRA score inﬁve out of six emotions the difference is only statistically signiﬁ-cant for low disappointed. The increase in audio quality might bea consequence of the lower emotion intensity instead of ﬁne-tuning.However, for low disappointed ﬁne-tuning increased audio qualitywithout reduced emotion intensity.

5. CONCLUSION

We proposed EmoCat, a novel language-agnostic EVC model basedon CopyCat, which operates directly on mel-spectrograms. It allowsto convert neutral to emotional samples in German with less than 45minutes of German emotional recordings. It achieves this by lever-aging large amounts of emotional English data with the same emo-tions. Even though the model is able to generate expressive speechat different intensities, we are not yet matching the expressiveness ofthe recordings. Moreover, we presented the gradient inverter block,an improvement to gradient reversal. This showed statistical signiﬁ-cant improvements in emotion intensity for four out of six emotionsin subjective listening tests. We also found minor improvements inaudio quality, at the cost of emotion intensity, through ﬁne-tuning onthe target emotional data. Future work is required to investigate theinﬂuence of increasing the amount of emotional German data andfurther improvements to the gradient inverter functions. . REFERENCES [1] Vatsal Aggarwal, Marius Cotescu, Nishant Prateek, JaimeLorenzo-Trueba, and Roberto Barra-Chicote, “Using VAEsand normalizing ﬂows for one-shot text-to-speech synthesisof expressive speech,” in

ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 6179–6183.[2] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster,Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang,Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural TTS synthe-sis by conditioning wavenet on mel spectrogram predictions,”in . IEEE, 2018, pp. 4779–4783.[3] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, andNobukatsu Hojo, “StarGAN-VC: Non-parallel many-to-manyvoice conversion using star generative adversarial networks,”in .IEEE, 2018, pp. 266–273.[4] Georgios Rizos, Alice Baird, Max Elliott, and Bj¨orn Schuller,“StarGAN for emotional speech conversion: Validated bydata augmentation of end-to-end emotion recognition,” in

ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.3502–3506.[5] Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre,Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote,Alexis Moinet, and Vatsal Aggarwal, “Towards achieving ro-bust universal neural vocoding,”

Proc. Interspeech 2019 , pp.181–185, 2019.[6] Sri Karlapati, Alexis Moinet, Arnaud Joly, ViacheslavKlimkov, Daniel S´aez-Trigueros, and Thomas Drugman,“Copycat: Many-to-many ﬁne-grained prosody transfer forneural text-to-speech,” arXiv preprint arXiv:2004.14617 ,2020.[7] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, andMark Hasegawa-Johnson, “AutoVC: Zero-shot voice styletransfer with only autoencoder loss,” Long Beach, Califor-nia, USA, 09–15 Jun 2019, vol. 97 of

Proceedings of MachineLearning Research , pp. 5210–5219, PMLR.[8] Yaroslav Ganin and Victor Lempitsky, “Unsupervised domainadaptation by backpropagation,” in

International conferenceon machine learning . PMLR, 2015, pp. 1180–1189.[9] Jian Gao, Deep Chakraborty, Hamidou Tembine, and OlaitanOlaleye, “Nonparallel emotional speech conversion,”

Proc.Interspeech 2019 , pp. 2858–2862, 2019.[10] Kun Zhou, Berrak Sisman, and Haizhou Li, “Transformingspectrum and prosody for emotional voice conversion withnon-parallel training data,” in

Proc. Odyssey 2020 The Speakerand Language Recognition Workshop , 2020, pp. 230–237.[11] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa,“WORLD: a vocoder-based high-quality speech synthesis sys-tem for real-time applications,”

IEICE TRANSACTIONS onInformation and Systems , vol. 99, no. 7, pp. 1877–1884, 2016.[12] Zixing Zhang, Bingwen Wu, and Bj¨orn Schuller, “Attention-augmented end-to-end multi-task learning for emotion pre-diction from speech,” in

ICASSP 2019-2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6705–6709. [13] Songxiang Liu, Yuewen Cao, and Helen Meng, “Emotionalvoice conversion with cycle-consistent adversarial network,” arXiv preprint arXiv:2004.03781 , 2020.[14] Kun Zhou, Berrak Sisman, Mingyang Zhang, and HaizhouLi, “Converting anyone’s emotion: Towards speaker-independent emotional voice conversion,” arXiv preprintarXiv:2005.07025 , 2020.[15] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, andHsin-Min Wang, “Voice conversion from unaligned corporausing variational autoencoding Wasserstein generative adver-sarial networks,” arXiv preprint arXiv:1704.00849 , 2017.[16] Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, andThomas Drugman, “Fine-grained robust prosody transfer forsingle-speaker neural text-to-speech,”

Proc. Interspeech 2019 ,pp. 4440–4444, 2019.[17] Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba,Thomas Merritt, Thomas Drugman, Srikanth Ronanki, and Vi-acheslav Klimkov, “Effect of data reduction on sequence-to-sequence neural TTS,” in

ICASSP 2019-2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 7075–7079.[18] Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drug-man, and Jaime Lorenzo-Trueba, “Dynamic prosody genera-tion for speech synthesis using linguistics-driven acoustic em-bedding selection,” arXiv preprint arXiv:1912.00955 , 2019.[19] Yanyao Bian, Changbin Chen, Yongguo Kang, and ZhenglinPan, “Multi-reference tacotron by intercross training for styledisentangling, transfer and control in speech synthesis,” arXivpreprint arXiv:1904.02373 , 2019.[20] B Series, “Method for the subjective assessment of intermedi-ate quality level of audio systems,”