Nonparallel Emotional Speech Conversion
Jian Gao, Deep Chakraborty, Hamidou Tembine, Olaitan Olaleye
NNonparallel Emotional Speech Conversion
Jian Gao , Deep Chakraborty , Hamidou Tembine , Olaitan Olaleye Department of Computer Science and Engineering, New York University, USA College of Information and Computer Sciences, University of Massachusetts Amherst, USA Signify (formerly Philips Lighting) Research, North America, USA { jg4631, tembine } @nyu.edu, [email protected], [email protected] Abstract
We propose a nonparallel data-driven emotional speech conver-sion method. It enables the transfer of emotion-related charac-teristics of a speech signal while preserving the speaker’s iden-tity and linguistic content. Most existing approaches requireparallel data and time alignment, which is not available in manyreal applications. We achieve nonparallel training based on anunsupervised style transfer technique, which learns a transla-tion model between two distributions instead of a deterministicone-to-one mapping between paired examples. The conversionmodel consists of an encoder and a decoder for each emotiondomain. We assume that the speech signal can be decomposedinto an emotion-invariant content code and an emotion-relatedstyle code in latent space. Emotion conversion is performedby extracting and recombining the content code of the sourcespeech and the style code of the target emotion. We tested ourmethod on a nonparallel corpora with four emotions. The eval-uation results show the effectiveness of our approach.
Index Terms : Emotional Speech Conversion, Non-paralleltraining, Style Transfer, Autoencoder, GANs
1. Introduction
Voice transformation (VT) is a technique to modify some prop-erties of human speech while preserving its linguistic informa-tion. VT can be applied to change the speaker identity, i.e.,voice conversion (VC) [1], or to transform the speaking styleof a speaker, such as emotion and accent conversion [2]. Inthis work, we will focus on emotion voice transformation. Thegoal is to change emotion-related characteristics of a speech sig-nal while preserving its linguistic content and speaker identity.Emotion conversion techniques can be applied to various tasks,such as enhancing computer generated speech, hiding negativeemotions for people, helping film dubbing, and creating moreexpressive voice messages on social media.Traditional VC approaches cannot be applied directly be-cause they change speaker identity by assuming pronuncia-tion and intonation to be a part of the speaker-independent in-formation. Since the speaker’s emotion is mainly conveyedby prosodic aspects, some studies have focused on modellingprosodic features such as pitch, tempo, and volume [3, 4]. In[5], a rule-based emotional voice conversion system was pro-posed. It modifies prosody-related acoustic features of neu-tral speech to generate different types of emotions. A speechanalysis-synthesis tool STRAIGHT [6] was used to extract fun-damental frequency ( F ) and power envelope from raw au-dio. These features were parameterized and modified basedon Fujisaki model [7] and target prediction model [8]. Theconverted features were then fed back into STRAIGHT to re-synthesize speech waveforms with desired emotions. How-ever, this method requires temporal aligned parallel data that is difficult to obtain in real applications; and the accurate timealignment needs manual segmentation of the speech signal atphoneme level, which is very time consuming.To address these issues, we propose a nonparallel train-ing method. Instead of learning one-to-one mapping betweenpaired emotional utterances ( x , x ) , we switch to training aconversion model between two emotional domains ( X , X ) .Inspired by disentangled representation learning in imagestyle transfer [9, 10], we assume that each speech signal x i ∈X i can be decomposed into a content code c ∈ C that repre-sents emotion-invariant information and a style code s i ∈ S i that represents emotion-dependent information. C is sharedacross domains and contains the information we want to pre-serve. S i is domain-specific and contains the information wewant to change. In conversion stage, we extract content codeof the source speech and recombine it with style code of thetarget emotion. A generative adversarial network (GAN) [11]is added to improve the quality of converted speech. Our ap-proach is nonparallel, text-independent, and does not rely onany manual operation.We evaluated our approach on IEMOCAP [12] for fouremotions: angry, happy, neutral, sad; which is widely studied inemotional speech recognition literature [13]. To our knowledge,this is the first attempt of nonparallel emotion conversion on thisdataset, though synthetic feature representations of emotionalspeech were proposed in [14]. We evaluate the model’s con-version ability by the percentage change from source emotionto target emotion. A subjective evaluation on Amazon MTurkwith hundreds of listeners was conducted. It shows our modelcan effectively change emotions and retain the speaker identity.The rest of the paper is organized as follows: Section 2presents the relation to prior work. Section 3 gives a detaileddescription of our model. Experiments and evaluation resultsare reported in Section 4. Finally, we conclude in Section 5.
2. Related Work
Previous emotion conversion methods directly modify parame-terized prosody-related features that convey emotions. The useof Gaussian mixture models (GMM) for spectrum transforma-tion was first proposed in [15]. A recent work [5] explored fourtypes of acoustic features: F contour, spectral sequence, dura-tion and power envelope, and investigated their impact on emo-tional speech synthesis The authors found that F and spectralsequence are the dominant factors in emotion conversion, whilepower envelope and duration alone has little influence. Theyfurther claimed that all emotions can be synthesized by modi-fying the spectral sequence, but did not provide a method to doit. In this paper, we focus on learning the conversion models for F and spectral sequence. a r X i v : . [ c s . L G ] A p r .2. Nonparallel training approaches Parallel data means utterances with the same linguistic contentbut varying in aspects to be studied. Since parallel data is hardto collect, nonparallel approaches have been developed. Someborrow ideas from image-to-image translation [16] and createGAN models [11] suitable for speech, such as VC-VAW-GAN[17], SVC-GAN [18], VC-CycleGAN [19, 20], VC-StarGAN[14]. Another trend is based on auto-regressive models likeWaveNet [21]. Although it can train directly on raw audio with-out feature extraction, the heavy computational load and hugeamount of training data required is not affordable for most users.
Our work draws inspiration from recent studies in image styletransfer. A basic idea is to find disentangled representations thatcan independently model image content and style. It is claimedin [9] that a Convolutional Neural Network (CNN) is an idealrepresentation to factorize semantic content and artistic style.They introduced a method to separate and recombine contentand style of natural images by matching feature correlations indifferent convolutional layers. For us, the task is to find disen-tangled representations for speech signal that can split emotionfrom speaker identity and linguistic content.
3. Method
The research on human emotion expression and perception hastwo major conclusions. First, human emotion perception is amulti-layered process. It’s figured out that humans do not per-ceive emotion directly from acoustic features, but through an in-termediate layer of semantic primitives [22]. They introduced athree-layered model and learnt the connections by a fuzzy infer-ence system. Some researchers found that adding middle layerscan improve emotion recognition accuracy [23]. Based on thisfinding, we suggest the use of multilayer perceptrons (MLP) toextract emotion-related information in speech signals.Second, the emotion generation process of human speechfollows the opposite direction of emotion perception. Thismeans the encoding process of the speaker is the inverse op-eration of the decoding process of the listener. We assume thatemotional speech generation and perception share the same rep-resentation methodology, that is, the encoder and decoder areinverse operations with mirror structures.Let x ∈ X and x ∈ X be utterances drawn from twodifferent emotional categories. Our goal is to learn a mappingbetween two distributions p ( x ) and p ( x ) . Since the joint dis-tribution p ( x , x ) is unknown for nonparallel data, the con-version models p ( x | x ) and p ( x | x ) cannot be directly esti-mated. To solve this problem, we make two assumptions:(i). The speech signal can be decomposed into an emotion-invariant content code and an emotion-dependent style code;(ii). The encoder E and decoder G are inverse functions. Fig. 1 shows the generative model of speech with a partiallyshared latent space. A pair of corresponding speech ( x , x ) isassumed to have a shared latent code c ∈ C and emotion-relatedstyle codes s ∈ S , s ∈ S . For any emotional speech x i , wehave a deterministic decoder x i = G i ( c i , s i ) and its inverseencoders c i = E ci ( x i ) , s i = E si ( x i ) . To convert emotion, wejust extract and recombine the content code of the source speech with the style code of the target emotion. x (cid:48) ← = G ( c , s ) = G ( E c ( x ) , s ) x (cid:48) ← = G ( c , s ) = G ( E c ( x ) , s ) (1)It should be noted that the style code s i is not inferred fromone utterance, but learnt from the entire emotion domain. Thisis because the emotion style from a single utterance is ambigu-ous and may not capture the general characteristics of the targetemotion. It makes our assumption slightly different from the cy-cle consistent constraint [24], which assumes that an exampleconverted to another domain and converted back should remainthe same as the original, i.e., x (cid:48)(cid:48) ← ← = x . Instead, we applya semi-cycle consistency in the latent space by assuming that E c ( x (cid:48) ← ) = c and E s ( x (cid:48) ← ) = s .Figure 1: Speech autoencoder model with partially shared la-tent space. Speech with emotion i is decomposed into anemotion-specific space S i and a shared content space C . Corre-sponding speech ( x , x ) are encoded to the same content code. Fig. 2 shows an overview of our nonparallel emotionalspeech conversion system. The features are extracted and re-combined by WORLD [25] and converted separately. We mod-ify F by linear transform to match statistics of the fundamen-tal frequencies in the target emotion domain. The conversion isperformed by log Gaussian normalization f = exp((log f − µ ) · σ σ + µ ) (2)where µ i , σ i are the mean and variance obtained from thesource and target emotion set. Aperiodicity (AP) is mappeddirectly since it does not contain emotion-related information.Figure 2: Overview of nonparallel emotion conversion system
For spectral sequence, we use low-dimensional representa-tion in mel-cepstrum domain to reduce complexity. Kameoka26] shows 50 MCEP coefficients are enough to synthesize full-band speech without quality degeneration. Spectra conversionis learnt by the autoencoder model in Fig. 1. The encoders anddecoders are implemented with gated CNN [27]. In addition, aGAN module is added and trained by robust optimization [28]to produce realistic spectral frames. Our model has 4 subnet-works E c , E s , G, D , in which D is the discriminator in GANto distinguish real samples from machine-generated samples. We jointly train the encoders, decoders and GAN’s discrimina-tors with multiple losses displayed in Fig. 3. To keep encoderand decoder as inverse operations, we apply reconstruction lossin the direction x i → ( c i , s i ) → x (cid:48) i . The spectral sequenceshould not change after encoding and decoding. L x i recon = E x i ( (cid:107) x i − x (cid:48) i (cid:107) ) , x (cid:48) i = G i ( E ci ( x i ) , E si ( x i )) (3)Figure 3: Train on multiple loss functions
In our model, the latent space is partially shared. Thusthe cycle consistency constraint [24] is not preserved, i.e., x (cid:48)(cid:48) ← ← (cid:54) = x . We apply a semi-cycle loss in the coding di-rection c → x (cid:48) ← → c (cid:48) ← and s → x (cid:48) ← → s (cid:48) ← . L c cycle = E c ,s ( (cid:107) c − c (cid:48) ← (cid:107) ) , c (cid:48) ← = E c ( x (cid:48) ← ) L s cycle = E c ,s ( (cid:107) s − s (cid:48) ← (cid:107) ) , s (cid:48) ← = E s ( x (cid:48) ← ) (4)Moreover, we add a GAN module to improve the speech qual-ity. The converted samples should be indistinguishable from thereal samples in the target emotion domain. GAN loss is com-puted between x (cid:48) i ← j and x i , ( i (cid:54) = j ) . L iGAN = E c j ,s i [log(1 − D i ( x (cid:48) i ← j ))] + E x i [log D i ( x i )] (5)The full loss is the weighted sum of L recon , L cycle , L GAN . min E c ,E s ,E c ,E s ,G ,G max D ,D L ( E c , E s , E c , E s , G , G , D , D )= λ s ( L s cycle + L s cycle ) + λ c ( L c cycle + L c cycle )+ λ x ( L recon + L recon ) + λ g ( L GAN + L GAN ) (6)where λ s , λ c , λ x , λ g control the weights of the components.
4. Experiments
Training emotional speech conversion models often suffersfrom lack of data. Parallel datasets such as Emo-DB [29] and RAVDESS [30] have limited sentence diversity and are diffi-cult to build. Our end-to-end model is trained on raw audiosignals of natural speech, and does not rely on paired data orany manual operations. Training set can be collected from dailyconversations in everyday life.
We evaluated the proposed approach on the Interactive Emo-tional Dyadic Motion Capture database (IEMOCAP) [12]. Itis organized in five sessions and contains 12 hours of audio-visual data. Each session records natural dialogues between apair of speakers in scripted and improvised scenarios, in whichthe emotions are naturally elicited. In this paper, we only con-sider four emotional categories: 1) angry, 2) happy, 3) neutral,4) sad. Since the model is not designed to change the speakeridentity, experiments are conducted for each speaker indepen-dently. We only use the utterances with a clear majority voteregarding the ground truth labels. There are 2754 utterancesshared amongst four emotional labels: ang (747), hap (675), neu(788), sad (544). Training and testing sets are non-overlappingutterances randomly selected from the same speaker (80% fortraining, 20% for test). For example in session 1, there are 420training samples and 108 testing samples for the female speaker.Training samples with fixed length of 128 frames are ran-domly selected from raw audio sequences. Energy-based voice-activity detection (VAD) is used to remove silent frames. Weuse WORLD [25] vocoder to extract fundamental frequencies,spectral sequences (sps) and aperiodicities (aps) from raw audiowaveforms sampled at 16KHz. The frame length is 5ms. Aftercoding, we take the first 24 Mel-cepstral coefficients (MCEPs)as feature vectors. Mean and variance of the entire training setare calculated for feature normalization. Testing samples canhave arbitrary temporal length, and be converted in real-time.
The network architecture is illustrated in Fig. 4, with detailslisted in Table 1. The autoencoders take 24-dimentional MCEPsas input and learn disentangled representations of content andstyle. In the content encoder, instance normalization (IN) [31]removes the original feature mean and variance that representemotional style information. In the style encoder, the emo-tional characteristics are encoded by a -layer MLP that out-puts channel-wise mean and variance µ ( s ) , σ ( s ) . Then they arefed into the decoder to reconstruct MCEP features. The desiredemotion is added through an adaptive instance normalization(AdaIN) [32] layer before activation. This mechanism is simi-lar to the conversion model of F in eq. (2).AdaIN ( c, s ) = σ ( s ) (cid:16) c − µ ( c ) σ ( c ) (cid:17) + µ ( s ) (7)The encoders and decoders are implemented with 1D-CNNs tocapture the temporal dependencies, while the GAN discrimi-nators are implemented with 2D-CNNs to capture the spectra-temporal patterns. Higher resolution data is generated by thepixel shuffler layer in upsample blocks. All networks use gatedlinear units (GLU) [27] to keep track of sequential information. Training details:
We use Adam optimizer with β = 0 . .The learning rate is initialized as . for D and . for E c , E s , G ; it begins with linear decay applied after K iter-ations. Weights are chosen as λ s = λ c = λ g = 1 , λ x = 10 . E c , E s , G are trained 2 iterations for each D ’s iteration in thefirst K iterations; after that they are trained equally.igure 4: The network structure of content encoder, style encoder, decoder, and GAN discriminator.
Table 1:
Network Architecture. C-F-K-S-X indicates convolu-tion layer with filters F, kernel size K, strides S, and shuffle X.IN is instance normalization; all modules use GLU activation.
Content Encoder
Conv1d, IN, GLU C-128-15-1Downsample1d × × Style Encoder
Conv1d, GLU C-128-15-1Downsample1d without IN × × × Decoder
Adaptive Resblock1d × × Discriminator
Conv2d C-128-(3,3)-(1,2)Downsample2d C-256-(3,3)-(2,2)Downsample2d C-512-(3,3)-(2,2)Downsample2d C-1024-(6,3)-(1,2)Dense layer sigmoid output (real/fake)
We evaluate the generated speech on three metrics: voice qual-ity, speaker similarity, and the emotion conversion ability.
Subjective evaluation
We perform perception tests on Ama-zon Mechanical Turk . Each utterance was listened by ran-dom human workers, and each worker can answer at most 5hits in a single experiment. To evaluate the voice quality andspeaker similarity, the listeners were asked to give a 5-scaleopinion score (5 for the best, 1 for the worst). The mean opin-ion score (MOS) is shown in Fig. 5. To annotate the emotionstate, each listener was asked to choose a label from the sourceand target emotions. For example in the trial ”ang2neu”, utter-ances with label ”ang” in IEMOCAP were converted to ”neu”,and the generated speech was labelled by the majority vote ofhuman annotators. We compute percentage change from thesource emotion to the target emotion. Higher value indicatesstronger ability of emotional conversion. We choose four emo-tion pairs with significant differences [29]. The baseline modelsare a simple linear F conversion system [33], and a neural net-work model VC-StarGAN [26]. Results are displayed in Fig. 6.Details and some converted speech samples are provided at . Results and Discussion
Note that not all utterances can besuccessfully converted, because some emotions are delivered by linguistic information, an immutable part in our setting. Ourmodel is slightly better than VC-StarGAN in terms of emotionconversion ability (average 48% vs 44%) and speaker similarity(average 3.55 vs 3.05). One reason is that VC-StarGAN is de-signed for voice conversion among different speakers, while ourmodel learns the disentangled representations that can decom-pose the emotional characteristic and speaker identity. More-over, VC-StarGAN has poor voice quality in the direction ofsad2ang (1.71) and sad2hap (1.81). In [26], all emotions aretrained together, thus it’s unfair to the sad domain since it haslower signal to noise ratio, and may amplify the noise whenconverted to more energetic emotions.Figure 5: MOS for voice quality and speaker similarity. left:voice quality. right: speaker similarity, 2ang means the targetemotion is Angry, and compared with originally Angry speech.
Figure 6:
Comparison of the emotion conversion ability of ourmodel and the baseline systems: (1) F0 conversion, (2) VC-starGAN [26]. ang2neu is conversion from Angry to Neutral.
5. Conclusion
We proposed a nonparallel emotional speech conversion ap-proach based on style transfer autoencoders. As our model doesnot require any paired data, transcripts or time alignment, it iseasy to apply in real-world situations. To the best of our knowl-edge, this is the first work on nonparallel emotion conversionusing style transfer. Future work includes phonetic durationconversion and designing a general model for unseen speakers.
6. Acknowledgements
This research was supported by Signify Research and U.S. AirForce under grant FA9550-17-1-0259. . References [1] S. Mohammadi and A. Kain, “An overview of voice conversionsystems,”
Speech Communication , vol. 88, pp. 65–82, 2017.[2] G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, andR. Gutierrez-Osuna, “Accent conversion using phonetic posteri-orgrams,” in
ICASSP . IEEE, 2018, pp. 5314–5318.[3] M. Wang, M. Wen, K. Hirose, and N. Minematsu, “Emotionalvoice conversion for mandarin using tone nucleus model–smallcorpus and high efficiency,” in
Speech Prosody 2012 , 2012.[4] Z. Wang and Y. Yu, “Multi-level prosody and spectrum conver-sion for emotional speech synthesis,” in
Signal Processing (ICSP) .IEEE, 2014, pp. 588–593.[5] Y. Xue, Y. Hamada, and M. Akagi, “Voice conversion for emo-tional speech: Rule-based synthesis with degree of emotion con-trollable in dimensional space,”
Speech Communication , vol. 102,pp. 54–67, 2018.[6] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Re-structuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0extraction: Possible role of a repetitive structure in sounds1,”
Speech communication , vol. 27, no. 3-4, pp. 187–207, 1999.[7] H. Fujisaki and K. Hirose, “Analysis of voice fundamental fre-quency contours for declarative sentences of japanese,”
ASJsJapan (E) , vol. 5, no. 4, pp. 233–242, 1984.[8] Y. Xue and M. Akagi, “A study on applying target predictionmodel to parameterize power envelope of emotional speech,” in
RISP workshop NCSP’16 , 2016.[9] L. Gatys, A. Ecker, and M. Bethge, “Image style transfer usingconvolutional neural networks,” in
CVPR . IEEE, 2016, pp. 2414–2423.[10] X. Huang, M. Liu, S. Belongie, and J. Kautz, “Multimodal un-supervised image-to-image translation,” in
The European Confer-ence on Computer Vision (ECCV) , September 2018.[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-sarial nets,” in
NIPS , 2014, pp. 2672–2680.[12] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim,J. Chang, S. Lee, and S. Narayanan, “Iemocap: Interactive emo-tional dyadic motion capture database,”
Language resources andevaluation , vol. 42, no. 4, p. 335, 2008.[13] M. Neumann and N. T. Vu, “Improving speech emotion recog-nition with unsupervised representation learning on unlabeledspeech,” in
ICASSP 2019 - 2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , May 2019,pp. 7390–7394.[14] S. Sahu, R. Gupta, and C. Y. Espy-Wilson, “On enhancingspeech emotion recognition using generative adversarial net-works,” in
Interspeech 2018, 19th Annual Conference of the Inter-national Speech Communication Association, Hyderabad, India,2-6 September 2018. , 2018, pp. 3693–3697.[15] H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari, and K. Shikano,“Gmm-based voice conversion applied to emotional speech syn-thesis,” in
Eurospeech , 2003.[16] M. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-imagetranslation networks,” in
Advances in Neural Information Pro-cessing Systems (NIPS) . Curran Associates, Inc., 2017, pp. 700–708.[17] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang,“Voice conversion from unaligned corpora using variational au-toencoding wasserstein generative adversarial networks,” in
Proc.Interspeech 2017 , 2017, pp. 3364–3368.[18] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino,“Sequence-to-sequence voice conversion with similarity metriclearned using generative adversarial networks,” in
Proc. Inter-speech 2017 , 2017, pp. 1283–1287. [19] F. Fang, J. Yamagishi, I. Echizen, and J. Lorenzo-Trueba, “High-quality nonparallel voice conversion based on cycle-consistent ad-versarial network,” arXiv preprint arXiv:1804.00425 , 2018.[20] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,”
CoRR , vol. abs/1904.04631, 2019. [Online]. Available: http://arxiv.org/abs/1904.04631[21] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio.” in
SSW , 2016, p.125.[22] C. Huang and M. Akagi, “A three-layered model for expressivespeech perception,”
Speech Communication , vol. 50, no. 10, pp.810–828, 2008.[23] X. Li and M. Akagi, “Multilingual speech emotion recognitionsystem based on a three-layer model,” in
INTERSPEECH , 2016.[24] J. Zhu, T. Park, P. Isola, and A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in
IEEEICCV , Oct 2017.[25] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-basedhigh-quality speech synthesis system for real-time applications,”
IEICE Trans. on Information and Systems , vol. 99, no. 7, pp.1877–1884, 2016.[26] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc:non-parallel many-to-many voice conversion using star generativeadversarial networks,” in ,2018, pp. 266–273.[27] Y. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language model-ing with gated convolutional networks,” in
ICML , 2017, pp. 933–941.[28] J. Gao, Y. Xu, J. Barreiro-Gomez, M. Ndong, M. Smyrnakis,and H. Tembine, “Distributionally robust optimization,” in
Optimization Algorithms , J. Valdman, Ed. Rijeka: IntechOpen,2018, ch. 1. [Online]. Available: https://doi.org/10.5772/intechopen.76686[29] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, andB. Weiss, “A database of german emotional speech,” in
INTER-SPEECH 2005 - Eurospeech, 9th European Conference on SpeechCommunication and Technology, Lisbon, Portugal, September 4-8, 2005 , 2005, pp. 1517–1520.[30] S. R. Livingstone and F. A. Russo, “The ryerson audio-visualdatabase of emotional speech and song (ravdess): A dynamic,multimodal set of facial and vocal expressions in north americanenglish,”
PLOS ONE , vol. 13, no. 5, pp. 1–35, 05 2018.[31] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normal-ization: The missing ingredient for fast stylization,”
CoRR , vol.abs/1607.08022, 2016.[32] X. Huang and S. Belongie, “Arbitrary style transfer in real-timewith adaptive instance normalization,” in
ICCV , Oct 2017.[33] J. Tao, Y. Kang, and A. Li, “Prosody conversion from neutralspeech to emotional speech,”