[PDF] Non-parallel Emotion Conversion using a Deep-Generative Hybrid Network and an Adversarial Pair Discriminator

Abstract

We introduce a novel method for emotion conversion in speech that does not require parallel training data. Our approach loosely relies on a cycle-GAN schema to minimize the reconstruction error from converting back and forth between emotion pairs. However, unlike the conventional cycle-GAN, our discriminator classifies whether a pair of input real and generated samples corresponds to the desired emotion conversion (e.g., A to B) or to its inverse (B to A). We will show that this setup, which we refer to as a variational cycle-GAN (VC-GAN), is equivalent to minimizing the empirical KL divergence between the source features and their cyclic counterpart. In addition, our generator combines a trainable deep network with a fixed generative block to implement a smooth and invertible transformation on the input features, in our case, the fundamental frequency (F0) contour. This hybrid architecture regularizes our adversarial training procedure. We use crowd sourcing to evaluate both the emotional saliency and the quality of synthesized speech. Finally, we show that our model generalizes to new speakers by modifying speech produced by Wavenet.

Full PDF

NNon-parallel Emotion Conversion using a Deep-GenerativeHybrid Network and an Adversarial Pair Discriminator

Ravi Shankar, Jacob Sager, Archana Venkataraman

Department of Electrical and Computer Engineering, Johns Hopkins University, USA [email protected], [email protected], [email protected]

Abstract

We introduce a novel method for emotion conversion in speechthat does not require parallel training data. Our approachloosely relies on a cycle-GAN schema to minimize the recon-struction error from converting back and forth between emo-tion pairs. However, unlike the conventional cycle-GAN, ourdiscriminator classiﬁes whether a pair of input real and gen-erated samples corresponds to the desired emotion conversion(e.g., A → B) or to its inverse (B → A). We will show thatthis setup, which we refer to as a variational cycle-GAN (VC-GAN), is equivalent to minimizing the empirical KL divergencebetween the source features and their cyclic counterpart. In ad-dition, our generator combines a trainable deep network witha ﬁxed generative block to implement a smooth and invertibletransformation on the input features, in our case, the fundamen-tal frequency (F0) contour. This hybrid architecture regularizesour adversarial training procedure. We use crowd sourcing toevaluate both the emotional saliency and the quality of synthe-sized speech. Finally, we show that our model generalizes tonew speakers by modifying speech produced by Wavenet.

Index Terms : Adversarial Networks, Unsupervised Learning,Emotion Conversion, Deformable Registration

1. Introduction

From automated customer support to hand-held devices, speechsynthesis plays an important role in modern-day technology.While speech synthesis has undergone revolutionary advance-ments over the past few years, generating emotional cues re-mains an open challenge in the ﬁeld. Emotional speech synthe-sis has the potential to facilitate more natural and meaningfulhuman-computer interactions, and it provides a foundation forstudying human intent, perception, and behavior [1].The success of deep neural networks has brought about aswift change in how speech synthesis is approached. Deepneural networks can generate natural sounding speech givenenough training examples [2, 3, 4]. However, these modelshave little control over the speaking style, including emotionalinﬂection. One reason is the lack of training data to learn net-works speciﬁc to each emotional class. Unsupervised modelssuch as [5, 6] provide a promising middle ground by separat-ing the speaking style from the content. However, it is difﬁcultto tune the parameters of these models to synthesise speech ina predetermined emotion. Furthermore, the synthesis rate ofthese state-of-the-art models is slow due to their autoregressivescheme [7]. These limitations motivate the use of emotion con-version as an alternative to end-to-end synthesis. Broadly, thegoal of emotion conversion is to modify the perceived affectof a speech utterance without changing its linguistic content orspeaker identity. It allows a user greater control over the speak-ing style while being easy to train on limited data resources. Emotional cues in speech are conveyed through vocal in-ﬂections known as prosody. Key attributes of prosody includethe fundamental frequency (F0) contour, the relative energy ofthe signal, and the spectrum [8]. Many supervised and unsu-pervised algorithms have been proposed for emotion conver-sion. For example, the work of [9] proposed a Gaussian mixturemodel (GMM) to jointly model the source and target prosodicfeatures. During inference, the target features are estimatedfrom the source via a maximum likelihood optimization. A re-cent approach by [10] uses a Bidirectional LSTM (Bi-LSTM)to predict the spectrum and F0 contour. To overcome the datalimitation, the authors pre-train their model on a voice conver-sion dataset and then ﬁne-tune it for emotion conversion. Theprosodic manipulation proposed by [11, 12] uses a highwayneural network to predict the F0 and intensity for each frameof the input utterance. While these models have made signif-icant contributions to the ﬁeld, they require parallel emotionalspeech data for training, which limits their generalizability.An unsupervised technique to disentangle style and contentfrom speech has been proposed by [13]. This algorithm uses ar-chitecture based priors to separate style and content from spec-trum while modifying the F0 using a linear Gaussian model.The authors of [14] offer a simpler cycle-GAN model for non-parallel emotion conversion, which independently modiﬁes thespectrum and F0 contour. The latter is parameterized via awavelet transform, which expands the input feature dimension-ality. These approaches, however, are trained and evaluated onsingle speakers, with no validation on multispeaker conversion.In this paper, we propose a novel method for non-parallelemotion conversion that blends the cycle-GAN architecturewith implicit regularization from a generative curve registra-tion method. Our novel loss function for the F0 conversionleads to an adversarial training where the discriminator classi-ﬁes whether a pair of real and generated F0 contours representsa valid conversion. Another contribution of our model is that thegenerator combines a trainable deep neural network with a gen-erative component to implement a smooth and invertible warp-ing of the source F0 contour. The entire model is trained jointlyby back-propagating through the generative block to optimizethe parameters. We evaluate our model on the multi-speakerVESUS dataset [15] and use crowd sourcing to verify both theemotional saliency and speech quality of the converted utter-ances. We also demonstrate the generalizability of our modelby converting speech produced by Google Wavenet [2].

2. Method

The foundation of our model is a cycle-GAN [16, 17], whichoptimizes the cycle consistency of converting back and forthbetween emotions. However, we adapt the traditional cycle-GAN framework to align the distribution of transformed sourcefeatures to the distribution of target features. Accordingly, werefer to our model as a variational cycle-GAN (VCGAN). a r X i v : . [ ee ss . A S ] A ug eep Stochastic Block Generative Block 𝒑 𝑨 𝑺 𝑨 𝒎 𝒑 𝑩 Discriminator 𝒑 𝑨 𝒑 𝑩 𝑷 𝜸 (𝒑 𝑨 , 𝒑 𝑩 )𝑷 𝜽 (𝒑 𝑨 , 𝒑 𝑩 ) Downsample I npu t ( r e a l ) C o n v ( x x ) G L U C o n v ( x x ) I n s t a n c e N o r m G L U D e n s e I npu t ( f a k e ) C o n v ( x x ) I n s t a n c e N o r m G L U C o n v ( x x ) I n s t a n c e N o r m G L U 𝑷 𝜸 (𝒑 𝑨 , 𝒑 𝑩 )𝑷 𝜽 (𝒑 𝑨 , 𝒑 𝑩 ) RNN ( 𝒑 𝑩𝟎 , 𝒎 ) 𝒑 𝑩𝒔 𝒎 𝒔 I npu t C o n v ( x x ) G L U C o n v ( x x ) I n s t a n c e N o r m P i x e l S hu ff l e r G L U C o n v ( x x ) I n s t a n c e N o r m G L U C o n v ( x x ) O u t pu t C o n v ( x x ) I n s t a n c e N o r m G L U C o n v ( x x ) I n s t a n c e N o r m G L U S u m Downsample Residual Connections Upsample C o n v ( x x ) I n s t a n c e N o r m G L U C o n v ( x x ) I n s t a n c e N o r m G L U C o n v ( x x ) I n s t a n c e N o r m G L U S u m C o n v ( x x ) I n s t a n c e N o r m G L U C o n v ( x x ) I n s t a n c e N o r m G L U S u m C o n v ( x x ) I n s t a n c e N o r m P i x e l S hu ff l e r G L U Figure 1:

Block diagram and neural network architectures of VCGAN. The top row shows the neural network used as the stochasticcomponent of the generators. The bottom left ﬁgure shows the deterministic component of generators as represented by an RNN. Thebottom right ﬁgure illustrates the architecture of neural network used as discriminator for classifying the joint densities.

We train the generators in our VCGAN using three differentloss terms. The ﬁrst is a cycle consistency loss, which ensuresinvertibility of the generator transforms. Formally, let A and Bdenote the source and target emotion classes, respectively. Inthe context of F0 conversion, the cycle consistency loss ensuresthat a given F0 contour, denoted by p A , is close to itself afterundergoing the sequence of transformations A → B → A. Mathe-matically, this loss is expressed as L C = E [ (cid:107) p A − p cA (cid:107) ] .Ultimately, the cycle-consistency loss L C , is a sample-speciﬁc loss and only provides a weak coupling between thetwo generators after each cyclic transformation. This problemis exaggerated by the discriminator, which is agnostic to thepresence of a second generator in the cycle-GAN. The infer-ence process can be improved by recognizing that the input dis-tribution of one generator is the learnable output distributionof the other. To leverage this information, we add a KL diver-gence penalty on the input distribution of a generator and thetarget distribution of its complementary generator. It allows usto exploit the cyclic property at a global level. Speciﬁcally, let p A and p B denote the source and target F0 contours of utter-ances in emotion A and B, respectively. Let G γ : p A → p B and G θ : p B → p A denote the two generators that transformthe F0 contours between emotional classes. The correspondinglearned data distributions are given by P γ ( p B ) and P θ ( p A ) .The KL divergence loss for generator G γ can be expressed as: L KL = KL (cid:16) P ( p A ) (cid:107) P θ ( p A ) (cid:17) (1)Optimizing this loss provides an additional coupling betweenthe forward and backward transformations, one that entanglesthe two generators beyond cyclic consistency loss. Next, weshow that this loss eliminates the need for the discriminator toclassify samples from the marginal distributions i.e, real vs gen-erated. By total probability law, we can write P θ ( p A ) as: P θ ( p A ) = (cid:90) P θ ( p A | p B ) P ( p B ) d p B (2)The integral form in Eq. (2) however, is intractable. To circum-vent this, we use the variational trick and derive an upper bound on the KL penalty. Combining Eq. (1) and Eq. (2), we get: KL = (cid:90) P ( p A ) log P ( p A ) (cid:82) P θ ( p A | p B ) P ( p B ) d p B d p A = − (cid:90) P ( p A ) log (cid:90) P θ ( p A | p B ) P γ ( p B | p A ) P ( p B ) P γ ( p B | p A ) d p B d p A − H ( p A ) ≤ − (cid:90) P ( p A ) (cid:90) P γ ( p B | p A ) log P θ ( p A | p B ) P ( p B ) P γ ( p B | p A ) d p B d p A − H ( p A )= (cid:90) P ( p A ) (cid:90) P γ ( p B | p A ) log P γ ( p B | p A ) P ( p A ) P θ ( p A | p B ) P ( p B ) d p B d p A (3)where we have used Jensen’s inequality between the second andthird steps. The ratio of probabilities in Eq. (3) compares thejoint densities of ( p A , p B ) . We estimate this ratio by a dis-criminator network denoted by D γ . This discriminator acts asa global connector between the generators G γ and G θ by clas-sifying the joint densities. Notice that the KL term in Eq. (1)appears only as a function of parameter θ but the variationaltrick allows us to introduce the parameters of G γ into the pic-ture. When training G γ , optimizing the upper bound on KLdivergence is equivalent to minimizing adversarial loss: L KL ≤ E p A ∼ P ( p A ) (cid:104) E p B ∼ P γ (cid:104) log (cid:16) D γ ( p A , p B ) (cid:17)(cid:105)(cid:105) (4)So far, we have derived the generator loss based on the F0contour. In practice, we condition the generators on both spec-trum and F0 contour i.e, G γ : S A × p A → p B . Here, S A de-notes the source emotion spectrum. Conditioning on spectrumallows VCGAN to learn the joint relationship between prosodicfeatures. We can show that Eq. (3) is also an upper bound to: E S A (cid:104) KL (cid:16) P ( p A | S A ) (cid:107) P θ ( p A | S A ) (cid:17)(cid:105) (5)Thus, we minimize the distance between conditional densitiesover F0 contours as opposed to the marginal densities in Eq. (1).The expectation in Eq. (5) averages over the spectral variations. lgorithm 1: Warping to generate the target F0 contourgiven the momenta and source F0 contour function GenerateF0 ( m , p A ) ; Input : momenta m and source F0 p A Output: target F0 p B Set s = 0 , [ p B ] = p A and [ m ] = m ; if s < then d i,j ← [ p A ] si − [ p A ] sj ; K i,j ← exp − ( d i,j ) σ ; [ p B ] s +1 i ← [ p B ] si + (cid:80) l K i,l · [ m ] sl ; [ m ] s +1 i ← [ m ] si + 2 · (cid:80) j − Kσ d i,j · [ m ] si [ m ] sj ; s ← s + 1 ; else return [ p B ] s ; end2.2. Generative Hybrid Architecture Adversarial training is susceptible to mode collapse, imbalancebetween generator-discriminator losses, and the architecture ofneural networks. Keeping this in mind, we model the gener-ated target F0 contour as a smooth and invertible warping ofthe source F0. Such warpings are also known as diffeomor-phisms [18, 19] and can be parameterized by low dimensionalembeddings called the momenta [20]. Therefore, our targetF0 estimation is a two-step process: ﬁrst, we estimate the mo-menta, and second, we modify the source F0 contour via a de-terministic warping using momenta. As a result, our generatorscan be divided into two blocks, a stochastic component withtrainable parameters and a deterministic component with staticparameters. Speciﬁcally, let m denote the latent momenta. Thetarget F0 is generated by following Alg. 1. The dimensions ofthe momenta are the same as F0 contour. The kernel smooth-ness parameter, σ is empirically ﬁxed at 50 to span the F0’srange. The warping function can be represented as a recurrentneural network (RNN) because of its iterative nature (Fig. 1).The advantage of this hybrid architecture is to stabilize the tar-get F0 generation. In the absence of any such control mecha-nism, the F0 contours swing wildly, and causes mode collapse.We constrain the generators to sample smoothly varyingmomenta by adding L m = E [ (cid:107)∇ m (cid:107) ] to the loss. We ap-proximate the gradient of momenta by its ﬁrst-order difference.The ﬁnal objective for generator G γ is given by: L Gγ = λ c E (cid:104) (cid:107) p A − p cA (cid:107) (cid:105) + λ m E (cid:104) (cid:107)∇ m (cid:107) (cid:105) +(1 − λ c − λ m ) E ( S A, p A ) (cid:104) E p B ∼ Pγ (cid:104) log (cid:16) D γ ( p A , p B ) (cid:17)(cid:105)(cid:105) (6)To update the parameters of the stochastic part of generator net-work, the gradient back-propagates through the deterministicblock, which is implemented as matrix-vector operations. Similar to [21], we model the ratio term in Eq. (3) by a dis-criminator denoted by D γ that distinguishes between the jointdistributions of p A and p B learned by G γ and G θ , respectively.During training of the discriminator D γ , we minimize: L D γ = − E ( S A , p A ) (cid:104) E p B ∼ P γ (cid:104) log (cid:16) D γ ( p A , p B ) (cid:17)(cid:105)(cid:105) − E ( S B , p B ) (cid:104) E p A ∼ P θ (cid:104) log (cid:16) − D γ ( p A , p B ) (cid:17)(cid:105)(cid:105) (7)Similar discriminators have been proposed in [22, 23] to trainautoencoders in adversarial setting. We use this discriminator in VCGAN to establish a macro connection between the twogenerators. In fact, the optimal discriminators train the corre-sponding generators to minimize the Jensen-Shanon divergencebetween P γ ( p A , p B ) and P θ ( p A , p B ) [23].We use the 23 dimensional MFCC features for spectrumrepresentation over a context of 128 frames extracted using a5ms long window. The dimensionality of F0 contour is 128x1while that of spectrum is 128x23. The momenta variable is ofthe same dimensionality as F0 which is 128x1. The hyperpa-rameters were set to λ c = λ m = https://engineering.jhu.edu/nsa/links/ .

3. Experiments and Results

We evaluate our VCGAN model against three baselines viacrowd-sourcing on Amazon mechanical Turk (AMT). Here, weplay both the neutral speech and the converted speech. The lis-tener is asked to classify the emotion in the converted speechand rate its quality on a scale from 1 to 5. We randomize thesamples to weed out any non-diligent worker and identify bots.

We use VESUS dataset [15] to carry out the experiments in thispaper. VESUS contains a set of 250 utterances spoken by 10actors in multiple emotions. We train one model for each pairof emotions i.e, neutral to angry, neutral to happy and neutral tosad. The dataset, also comes with human emotional ratings by10 AMT workers. These ratings denote the ratio of AMT work-ers who correctly identify the intended emotion in a recordedutterance. For robustness, we only use utterances that are cor-rectly rated as emotional by at least 50 % of the total workers.The number of utterances for each emotion pair are:• Neutral to Angry conversion : 1534 for training, 72 forvalidation and, 61 for testing.•

Neutral to Happy conversion : 790 for training, 43 forvalidation and, 43 for testing.•

Neutral to Sad conversion : 1449 for training, 75 forvalidation and, 63 for testing.

The ﬁrst baseline is the GMM based joint modeling approachof [9]. This algorithm learns a mixture model in the joint spaceof source and target F0 and spectral features. During inference,a global variance constraint generates non-smooth target fea-tures using maximum likelihood. One caveat is that the GMMfails to generate intelligible speech when trained across multiplespeakers. As a result, our GMM results are based on trainingsingle-speaker models and averaging the results across them.All other methods are trained on the full multi-speaker data.The second baseline is the Bi-LSTM approach of [10]. Thismethod parameterizes the F0 and the energy contours using aWavelet transform. Following the authors’ strategy, we pre-train the model on a voice conversion dataset [26]. It is thenﬁne-tuned for emotion conversion on the VESUS dataset. * * * * * * ** * ** * ** *

Figure 2:

Conﬁdence of emotion conversion (top) and qualityof reconstructed speech (bottom) on VESUS dataset. Marker ∗ above the bars denote p < − for a two sample t-test. * * * * * *** * Figure 3:

Conﬁdence of emotion conversion (top) and qualityof reconstructed speech (bottom) on wavenet speech. Marker ∗ above the bars denote p < − for a two sample t-test. The third baseline is the unsupervised cycle-GAN proposedby [14]. It modiﬁes the spectrum and F0 contour using twoseparate cycle-GANs. As described in [14], wavelet transformis applied to the F0 contour for expanding dimensionality.

Fig. 2 shows the result of our multispeaker emotion conversionbased on the VESUS dataset. Our proposed VCGAN outper-forms the baselines on two emotion pairs, namely, neutral toangry and neutral to sad. The cycle-GAN comes a close sec-ond ahead of the GMM and Bi-LSTM models. This shows thatgenerative models contain the needed ﬂexibility for this task.Note that GMM’s emotion saliency is close to or better thanBi-LSTM, largely because we train a separate model for eachspeaker. The poor saliency ratings for the Bi-LSTM likely re-ﬂect the difﬁculty of training recurrent architectures on smalldatasets. VCGAN performs slightly worse than the cycle-GANfor neutral to happy conversion due to the smaller number ofsamples for training. For the other two emotion pairs, our hy-brid generative approach outperforms the baselines by learningthe complex relationship between spectrum and F0 contour.VCGAN does extremely well in retaining the quality ofspeech after conversion which is evident from the mean opinionscores (MOS) shown in Fig. 2. This is mainly because the pre-diction of F0 contour is conditioned on spectrum, which allows * * *

Figure 4:

Comparison of F0 estimation by the proposed modeland the Cycle-GAN on VESUS parallel utterances. Marker ∗ above the bars denote p < − for a two sample t-test. the generator to exploit the harmonicity present in the spectrum.Bi-LSTM method has the best MOS among the baseline algo-rithms because empirically it does not change the utterance butmerely copies the source features as output. To simulate an unseen speaker, we generate 100 neutral utter-ances using Wavenet [2]. We then apply the models learned onthe VESUS dataset without any ﬁne-tuning. We have omittedthe GMM, since it can only be trained on single speakers, andwe do not have access to emotional Wavenet utterances. Fig. 3illustrates the results of this experiment. As seen, the Bi-LSTMdoes just as poorly on an unseen speaker as on the VESUSdataset. Empirically, we observe that the Bi-LSTM output re-sembles a distorted identity mapping. While the cycle-GANlargely retains its performance, it achieves a lower emotionalsaliency than our model in all cases. This is because the smoothwarping between source and target F0 automatically adjusts tothe frequency range of a new speaker.Both the cycle-GAN and VCGAN exhibit a decrease inMOS for the Wavenet utterances. Here, the minimal conversionallows the Bi-LSTM to produce more natural sounding speechfor neutral to angry conversion. Nonetheless, our method comesin a close second. Taken together, we can conclude that there isa trade-off between the emotion saliency and the speech quality.VCGAN balances it better in comparison to the baselines.

The parallel utterances in VESUS allow us to objectively mea-sure the difference between the converted and real F0 contours.We compare our approach to the Cycle-GAN model which isalso a non-parallel technique. As seen in Fig. 4, our approachhas lower mean absolute error for all three emotion pairs. Thisindicates that intonations might have a unique trend for eachemotion, which VCGAN can exploit better than cycle-GAN.

4. Conclusions

We proposed a novel approach to train a pair of GANs in acyclic schema by comparing their induced joint densities. TheGAN generators were composed of a trainable and a static com-ponent. The trainable component generated a latent embeddingcalled momenta, which was then used by the ﬁxed component towarp the source F0 contour. Our model is both objectively andsubjectively superior to the existing state-of-the-art methods. Itachieves a good balance between the emotion saliency and re-construction quality. The novel loss function used for trainingthe generators helps unfold the complex relationship betweenspectrum and F0. Further, the deformation based modeling oftarget F0 contour makes it robust for new unseen speakers. . References [1] D. Schacter, D. T. Gilbert, and D. M. Wegner,

Psychology (2ndEdition) . Worth Publications, 2011.[2] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio,”

CoRR , vol.abs/1609.03499, 2016.[3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous,Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by con-ditioning wavenet on mel spectrogram predictions,”

CoRR , vol.abs/1712.05884, 2017.[4] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord,S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neural audiosynthesis,” in

Proc. ICML 2018 , J. G. Dy and A. Krause, Eds.,vol. 80. PMLR, 2018, pp. 2415–2424.[5] Y. Wang, R. J. Skerry-Ryan, Y. Xiao, D. Stanton, J. Shor, E. Bat-tenberg, R. Clark, and R. A. Saurous, “Uncovering latent style fac-tors for expressive speech synthesis,”

CoRR , vol. abs/1711.00520,2017.[6] E. Battenberg, S. Mariooryad, D. Stanton, R. Skerry-Ryan,M. Shannon, D. Kao, and T. Bagby, “Effective use of variationalembedding capacity in expressive end-to-end speech synthesis,”

CoRR , vol. abs/1906.03402, 2019.[7] Y. Ning, S. He, Z. Wu, C. Xing, and L.-J. Zhang, “A review ofdeep learning based speech synthesis,”

Applied Sciences , vol. 9,p. 4050, 09 2019.[8] K. R. Scherer, “Vocal communication of emotion: A review ofresearch paradigms,”

Speech Communication , vol. 40, no. 1, pp.227 – 256, 2003.[9] R. Aihara, R. Takashima, T. Takiguchi, and Y. Ariki, “Gmm-basedemotional voice conversion using spectrum and prosody features,”

American Journal of Signal Processing , vol. 2, pp. 134–138, 122012.[10] H. Ming, D.-Y. Huang, L. Xie, J. Wu, M. Dong, and H. Li, “Deepbidirectional lstm modeling of timbre and prosody for emotionalvoice conversion,” in

Proc. Interspeech 2016 , 09 2016, pp. 2453–2457.[11] R. Shankar, J. Sager, and A. Venkataraman, “A Multi-SpeakerEmotion Morphing Model Using Highway Networks and Maxi-mum Likelihood Objective,” in

Proc. Interspeech 2019 , 2019, pp.2848–2852.[12] R. Shankar, H.-W. Hsieh, N. Charon, and A. Venkataraman, “Au-tomated Emotion Morphing in Speech Based on DiffeomorphicCurve Registration and Highway Networks,” in

Proc. Interspeech2019 , 2019, pp. 4499–4503.[13] J. Gao, D. Chakraborty, H. Tembine, and O. Olaleye, “NonparallelEmotional Speech Conversion,” in

Proc. Interspeech 2019 , 2019,pp. 2858–2862.[14] K. Zhou, B. Sisman, and H. Li, “Transforming spectrum andprosody for emotional voice conversion with non-parallel train-ing data,”

CoRR , vol. abs/2002.00198, 2020.[15] J. Sager, R. Shankar, J. Reinhold, and A. Venkataraman, “VESUS:A Crowd-Annotated Database to Study Emotion Production andPerception in Spoken English,” in

Proc. Interspeech 2019 , 2019,pp. 316–320.[16] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative ad-versarial networks,”

CoRR , vol. abs/1406.2661, 2014.[17] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in

Proc. ICCV 2017 . IEEE Computer Society, 2017, pp. 2242–2251.[18] M. F. Beg, M. I. Miller, A. Trouv´e, and L. Younes, “Computinglarge deformation metric mappings via geodesic ﬂows of diffeo-morphisms,”

International journal of computer vision , vol. 61, no.139-157, 2005. [19] S. C. Joshi and M. I. Miller, “Landmark matching via large defor-mation diffeomorphisms,”

IEEE transactions on image process-ing , vol. 9, no. 8, pp. 1357–1370, 2000.[20] H.-W. Hsieh and N. Charon, “Diffeomorphic registration of dis-crete geometric distributions,”

CoRR , vol. abs/1801.09778, 2018.[21] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, andP. Abbeel, “Infogan: Interpretable representation learning by in-formation maximizing generative adversarial nets,” in

Proc. NIPS2016 , D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, andR. Garnett, Eds., 2016, pp. 2172–2180.[22] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. A.Sutton, “VEEGAN: reducing mode collapse in gans using im-plicit variational learning,” in

Proc. NIPS, 2017 , I. Guyon, U. vonLuxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vish-wanathan, and R. Garnett, Eds., 2017, pp. 3308–3318.[23] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky,O. Mastropietro, and A. C. Courville, “Adversarially learned in-ference,”

CoRR , vol. abs/1606.00704, 2016.[24] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,”

CoRR , vol. abs/1412.6980, 2015.[25] T. Kaneko and H. Kameoka, “Parallel-data-free voice conver-sion using cycle-consistent adversarial networks,”

CoRR , vol.abs/1711.11293, 2017.[26] J. Kominek and A. W Black, “The cmu arctic speech databases,”