[PDF] Multi-speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network

Abstract

We propose a novel method for emotion conversion in speech based on a chained encoder-decoder-predictor neural network architecture. The encoder constructs a latent embedding of the fundamental frequency (F0) contour and the spectrum, which we regularize using the Large Diffeomorphic Metric Mapping (LDDMM) registration framework. The decoder uses this embedding to predict the modified F0 contour in a target emotional class. Finally, the predictor uses the original spectrum and the modified F0 contour to generate a corresponding target spectrum. Our joint objective function simultaneously optimizes the parameters of three model blocks. We show that our method outperforms the existing state-of-the-art approaches on both, the saliency of emotion conversion and the quality of resynthesized speech. In addition, the LDDMM regularization allows our model to convert phrases that were not present in training, thus providing evidence for out-of-sample generalization.

Full PDF

MMulti-speaker Emotion Conversion via Latent Variable Regularizationand a Chained Encoder-Decoder-Predictor Network

Ravi Shankar , Hsi-Wei Hsieh , Nicolas Charon , Archana Venkataraman Department of Electrical and Computer Engineering, Johns Hopkins University Department of Applied Mathematics and Statistics, Johns Hopkins University [email protected], { hsieh,charon } @cis.jhu.edu, [email protected] Abstract

We propose a novel method for emotion conversion in speechbased on a chained encoder-decoder-predictor neural networkarchitecture. The encoder constructs a latent embedding of thefundamental frequency (F0) contour and the spectrum, whichwe regularize using the Large Diffeomorphic Metric Mapping(LDDMM) registration framework. The decoder uses this em-bedding to predict the modiﬁed F0 contour in a target emotionalclass. Finally, the predictor uses the original spectrum and themodiﬁed F0 contour to generate a corresponding target spec-trum. Our joint objective function simultaneously optimizes theparameters of three model blocks. We show that our methodoutperforms the existing state-of-the-art approaches on both,the saliency of emotion conversion and the quality of resynthe-sized speech. In addition, the LDDMM regularization allowsour model to convert phrases that were not present in training,thus providing evidence for out-of-sample generalization.

Index Terms : Emotion Conversion, Latent Variable Regular-ization, Crowd Sourcing, Quality Score

1. Introduction

Automated speech synthesis has radically transformed our in-teraction with machines. It is used in assistive technologies,such as screen readers for the visually impaired, and hands-freedevices, such as Amazon’s Echo. Emotional speech synthesisis the next milestone in this domain [1, 2]. For example, emo-tional machines can be deployed in call centers, where customerfrustration is a regular occurrence, and it can provide a betterfoundation for virtual companions for the elderly or impaired.The quality of machine-generated speech has improvedphenomenally in the last decade, largely due to the representa-tional power of deep neural networks [3, 4, 5], which are trainedon hundreds of hours of transcribed human speech. However,controlling the expressiveness of synthetic speech remains anopen challenge. Recent works in emotional speech synthesis in-clude [6], which generates singing voice conditioned on the in-put rhythm, pitch and linguistic features. A disentangled modelfor style and content is proposed by [7, 8] to infer the latent rep-resentations responsible for expressiveness. While these modelsrepresent seminal contributions to emotional speech synthesis,the latent representations are learned in an unsupervised man-ner, which makes it difﬁcult for the user to control the outputemotion. Another problem is the poor rate of speech generationdue to the auto-regressive nature of these models [9]. Thesechallenges motivate the study of emotion conversion as an al-ternative to end-to-end synthesis approaches. Notably, emotionconversion methods provide controllability over the generatedaffect, they require much less data to train, and the processingspeed is high enough for real-time applications. Several interesting approaches for emotion conversion havebeen proposed in the recent past. For example, the work of [10]uses a Gaussian Mixture Model with global variance constraint(GMM-GV) to modify the fundamental frequency (F0) con-tour and the spectrum. A bidirectional long-short term mem-ory (Bi-LSTM) based architecture has been proposed by [11]to estimate the F0 contour and the spectral features of the tar-get emotion utterance. Another approach by [12] converts thepitch contour and energy contour of the source utterance usinga highway neural network which maximizes the error log like-lihood in an expectation-maximization scheme. The same au-thors further proposed a curve registration based method [13] tomodify only the F0 contour. Finally, a cycle-consistent gener-ative adversarial network (cycle-GAN) proposed by [14] learnsto sample the pitch contour and the spectrum from the targetemotional class in an unsupervised manner. While these meth-ods have been successful in single-speaker settings, many ofthem fail on multispeaker dataset due to the larger overlap ofF0 and spectral features between emotional classes. In this pa-per we propose a novel approach to model the relationship be-tween the F0 contour and the spectral features, deriving it fromthe basic knowledge of these two representations. Furthermore,unlike other existing methods, our chained estimation also min-imizes the mismatch between F0 and the corresponding spectralharmonics. Our second contribution in this paper is to implic-itly model the target pitch contour as a smooth and invertiblewarping of source F0 contour. This is done by learning a la-tent embedding based on the Large Diffeomorphic Metric Map-ping (LDDMM) [15, 16] framework. In essence the embeddingserves as an intermediary between the source and target emo-tions. We demonstrate that imposing this constraint improvesthe prediction of the pitch contour signiﬁcantly.Our architecture consists of three separate convolutionalneural networks for predicting the embedding, the pitch con-tour, and the spectrum, respectively. These networks are trainedin an end-to-end fashion from a uniﬁed objective function. Wecompare our model against three state-of-the-art baseline meth-ods using the multispeaker VESUS dataset [17]. We furtherdemonstrate that our model does well on sentences, which arenot part of the training set, establishing its generalization capa-bility. Finally, in addition to emotion conversion, we show thatthe proposed model generates better quality of speech than thebaselines from both supervised and unsupervised domain.

2. Method

Our novel method uses a chained encoder-decoder-predictornetwork architecture to modify both the spectrum and the F0contour of an utterance. The three components of the architec-ture are jointly optimized through a uniﬁed loss function.Fig. 1 describes the relationship between the random vari- a r X i v : . [ ee ss . A S ] A ug A p A m AB p B S B Source F0 Target F0Source Spectrum Target SpectrumSource Utterance( U A ) Target Utterance( U B )Momenta Figure 1:

Graphical model of our emotion conversion strategy. m AB is the intermediary between emotion classes. ables in our model. We use WORLD vocoder [18, 19] for theanalysis and synthesis of speech. Given a source-target pair ofemotional utterances denoted by U A and U B , respectively, thesource utterance is decomposed into its components: the spec-trum ( S A ) and the F0 contour ( p A ). These components allow usto estimate an intermediate parameter, known as the momenta( m AB ). From here, the target F0 contour ( p B ) is modeledas a function of the source F0 contour ( p A ) and the momenta( m AB ). Next, we estimate the target spectrum ( S B ) given thetarget F0 contour ( p B ) and the source spectrum ( S A ). Finally,the estimated variables are used to synthesize the target emotionutterance. The joint distribution shown in Fig. 1 factorizes as: P ( p A , S A , m AB , p B , S B ) = P ( p A ) × P ( S A | p A ) × P ( m AB | p A , S A ) × P ( p B | p A , m AB ) × P ( S B | S A , p B ) (1) We use an explicit prior on the latent variable to improve theprediction of F0 and spectrum. Speciﬁcally, we model the tar-get F0 contour as a smooth and invertible deformation of thesource F0 contour. The idea of smooth deformations has beenused extensively for images [20], but here we use it for 2-Dcurves. Mathematically, let p t A and p t B denote a pair of sourceand target F0 contours, respectively. The variable t correspondsto the location of the analysis window as it moves across a givenspeech utterance. The objective of this deformation process is toestimate a series of small vertical displacements v t ( x ; s ) [15]over frequency and time. The variable s (cid:15) [0 , controls theevolution of these small displacements in the discrete setting.The registration problem can thus be formulated as: min v (cid:15)V (cid:90) (cid:107) v t ( · ; s ) (cid:107) V ds + λ T (cid:88) t =1 (cid:107) φ vt ( p t A ; 1) − p t B (cid:107) (2)Here, (cid:107)·(cid:107) V denotes the Hilbert norm which is implicitly deﬁnedin our case by a Gaussian kernel. The variable φ vt denotes thenet displacement ﬁeld i.e, φ vt = (cid:82) v t ( · ; s ) ds .Further, it has been theoretically shown in [21, 22] that theobjective in Eq. (2) can be reformulated in terms of variables m t , known as the initial momenta, according to: Γ( m ) = 12 T (cid:88) i,j =1 γ ij m i m j + λ T (cid:88) t =1 (cid:107) φ vt ( p t A ; 1) − p t B (cid:107) (3)The variable γ ij is an exponential smoothing kernel evaluatedon pairs of time points of the source contour p t A .During training, we solve Eq. (3) for every pair of sourceand target F0 contours to generate the ground truth momenta.This variable summarizes the transformation between emotionpairs. Since the momenta and source F0 contour uniquely spec-ify the transformation, we use it as an intermediary between anygiven pair of utterances. In comparison, [13] predicts a momen-tum for every frame of the pitch contour and then warps it over several iterations speciﬁed by variable s . It is a sub-optimalstrategy, as there is no temporal coherence constraint in pre-dicting the momenta. Note that we do not have access to theground truth momenta during testing and run the network in anopen loop fashion without intermediate regularization. Current methods in emotion conversion modify the F0 and spec-trum without imposing any explicit relationship between thefeatures. As a result, there are signiﬁcant residual harmonicspresent in the spectrum, which results in the poor quality ofresynthesised speech. Our approach overcomes this limitationvia the conditional relationships modeled in Fig. 1. Here, theconditional spectrum estimate is given by: ˆS B = arg max S B P ( S B | S A , p A ) (4)Using rules of probability, we can rewrite Eq. (4) as: ˆS B = arg max S B (cid:90) p B P ( S B , p B | S A , p A ) d p B = arg max S B (cid:90) p B P ( S B | S A , p B ) P ( p B | S A , p A ) d p B = arg max S B (cid:90) p B P ( S B | S A , p B ) × (cid:90) m AB P ( p B | m AB , p A ) × P ( m AB | S A , p A ) d m AB d p B = arg max S B (cid:90) m AB P ( m AB | S A , p A ) × (cid:90) p B P ( p B | m AB , p A ) × P ( S B | S A , p B ) d p B d m AB , where we have used Eq. (1) to derive the above expression. Theﬁrst term term we encounter is P ( m AB | S A , p A ) which is theprobability density of the intermediate latent representation i.e.,momenta. It is conditioned on both, the source F0 contour andthe spectrum. The second term, P ( p B | m AB , p A ) is the den-sity over the target F0 contour given the momenta and the sourceF0 contour. Finally, P ( S B | S A , p B ) is the target spectrum con-ditioned on the target pitch contour and the source spectrum.Note that the expression requires multiple integrations, and ishence, intractable. However, we can make point estimates foreach density function using a deep convolutional neural net-work [23] (CNN) thereby, allowing us to write: ˆm AB = arg max m AB P ( m AB | S A , p A ; θ e ) ˆp B = arg max p B P ( p B | ˆm AB , p A ; θ d ) ˆS B = arg max S B P ( S B | S A , ˆp B ; θ p ) (5)The CNN approximating P ( m AB | S A , p A ; θ e ) is called anencoder because it distills information about the input data. TheCNN modeling P ( p B | m AB , p A ; θ d ) is called the decoder be-cause it estimates the output pitch from the latent embeddingand source pitch contour. The encoder-decoder portion is a ba-sic sequence-to-sequence model for pitch contours. Finally, theCNN modeling P ( S B | S A , p B ; θ p ) is called a predictor as itgenerates the spectrum for the converted speech.The architecture of these CNNs is shown in Fig. 2. Weadapt the architecture from [24] by reducing the number ofresidual layers in each block. The entire sequence of three neu-ral networks is trained together from a uniﬁed objective. Theloss function for optimizing the parameters is given by: L = − log (cid:16) P (cid:16) m AB , p B , S B | S A , p A ; θ e , θ d , θ p (cid:17)(cid:17) = λ e (cid:107) ˆm AB − ¯m AB (cid:107) + λ d (cid:107) ˆp B − ¯p B (cid:107) + λ p (cid:107) ˆS B − ¯S B (cid:107) (6) npu t C o n v o l v e G L U C o n v o l v e I n s t a n c e N o r m G L U C o n v o l v e I n s t a n c e N o r m G L U C o n v o l v e I n s t a n c e N o r m G L U S u m C o n v o l v e I n s t a n c e N o r m P i x e l S hu ff l e r G L U C o n v o l v e I n s t a n c e N o r m G L U C o n v o l v e I n s t a n c e N o r m G L U S u m C o n v o l v e I n s t a n c e N o r m G L U C o n v o l v e I n s t a n c e N o r m P i x e l S hu ff l e r G L U C o n v o l v e O u t pu t Downsample Residual Connections Upsample

Encoder Decoder Predictor p A S A m AB p B S B Figure 2:

Block model representation of the encoder-decoder-predictor. Encoder and decoder use the same architecture whereaspredictor has an extra residual block. GLU in the model stands for the gated linear unit. We use instance normalization due to smallmini-batch size and pixel shufﬂing for up-sampling. The size and number of kernels are indicated below each convolution block.

During training, we minimize the negative log likelihood ofmomenta and the target features with respect to θ . We modelthe conditional distribution of each variable by Laplace densityfunction. The corresponding ground truths ( ¯m AB , ¯p B , ¯S B ) areused as the mean while the variances are assumed to be con-stant. This in turn is equivalent to minimizing the mean absoluteerror of each target variable with an appropriate scaling, deﬁnedby λ e , λ d and λ p , which are the hyperparameters in our model.One beneﬁt of coupling the neural networks is that the en-coder and the decoder become aware of the downstream taskof spectrum prediction. We train the neural network [25] us-ing Adam optimizer [26] with a learning rate of 1e-5 and amini-batch of size one. 23 dimensional MFCC features areused as spectrum representation extracted by an analysis win-dow of length 5ms. During training, the context size is ﬁxedat 640ms which results in dimensionality of × for F0contour and × for spectrum. The dimensions of mo-menta are same as the F0 contour. The hyperparameters, λ e , λ d and λ p are set to 0.01, 1e-4 and 1e-4, respectively. Wedo not normalize the input and output features during train-ing to preserve their scale. Code can be downloaded from: https://engineering.jhu.edu/nsa/links/ .

3. Experiments and Results

We carry out an ablation study for the momenta m AB and aqualitative evaluation of emotional salience and quality. We evaluate our algorithm on the VESUS dataset [17] collectedat Johns Hopkins University. VESUS contains 250 parallel ut-terances spoken by 10 actors (gender balanced) in neutral, sad,angry and happy emotional classes. Each spoken utterance has acrowd-sourced emotional saliency rating provided by 10 work-ers on Amazon Mechanical Turk (AMT). These ratings repre-sent the ratio of workers who correctly identify the intendedemotion in a recorded utterance. For robustness, we restrict ourexperiments to utterances that were correctly and consistentlyrated as emotional by at least 5 of the 10 AMT workers. As aresult, the total number of utterances used are as follows:•

Neutral to Angry conversion : 1534 utterances for train-ing, 72 for validation and, 61 for testing.•

Neutral to Happy conversion : 790 utterances for train-ing, 43 for validation and, 43 for testing. •

Neutral to Sad conversion : 1449 utterances for train-ing, 75 for validation and, 63 for testing.Our subjective evaluation includes both an emotion perceptiontest and, a quality assessment test. These experiments are car-ried out on Amazon Mechanical Turk (AMT); each pair ofspeech utterances is rated by 5 workers. The perception testasks the raters to identify the emotion in the converted speechsample, and the quality assessment test asks them to rate thequality of speech sample on a scale of 1 to 5. We include boththe neutral and converted utterances to account for the speakerbias. Further, the samples were randomized to mitigate the ef-fects of non-diligent raters and to identify bots.

We compare our encoder-decoder-predictor model to threestate-of-the-art baseline methods. The ﬁrst approach learns aGaussian mixture model using concatenated source and targetfeatures [10]. During inference, a maximum likelihood estimateof target features is made given the source features. A globalvariance constraint ensures that the estimate is not over-smooth,which is a common problem in joint modeling techniques.The second baseline is a Bi-LSTM supervised learning ap-proach [11]. Since Bi-LSTMs generally require considerabledata to train, we adopt the strategy in [11] of training the modelon a voice conversion task [27] and then ﬁne-tuning it for emo-tion conversion. This method encodes the prosody features viaa Wavelet transform to represent both short-term and long-termtrajectory information of F0 and energy contours.The third baseline is a recently proposed unsupervisedmethod for emotion conversion [14]. This algorithm uses cycle-GANs to inject emotion into neutral utterances. A set of cycle-GAN transforms the spectrum while the other set transforms theprosody features. Once again, prosodic features are parameter-ized using Wavelet basis similar to the Bi-LSTM.

As a sanity check, we carry out an ablation study to understandthe effect of latent variable regularization via the LDDMM mo-menta. Fig. 3 shows the resulting mean absolute error in pitchprediction for each emotion pair. As seen, the F0 predictionis statistically signiﬁcantly better in two emotional pairs. Neu-tral to happy conversion is an exception to this general trend,but we conjecture that this is due to the smaller training dataset( ∼

800 samples compared to > * * Figure 3:

Effect of latent variable regularization on the pre-diction of fundamental frequency (F0) for each emotion pair.Marker ∗ indicates p < − for paired t-test scores. * * * * * * * * ** * * * * * * * * * - p < 0.01 Figure 4:

Conﬁdence of emotion conversion (top) and the qual-ity of reconstruction (bottom) for VESUS test samples. un-regularized model, indicating that it is more robust.

Fig. 4 illustrates crowd-sourcing results on the VESUS testdataset. Our proposed method has the highest emotionalsaliency rating in comparison to the baselines. The GMM didnot produce intelligible speech when trained in a multi-speakersetting, as the F0 and spectral features do not exhibit distinctclusters when aggregated across speakers. Hence, the resultsin Fig. 4 correspond to single-speaker training/testing. We notethat our GMM evaluation is unfairly optimistic, and yet, theperformance is worse than our method and the cycle-GAN. TheBi-LSTM model which simultaneously predicts the wavelet co-efﬁcients for F0 and energy, along with the spectrum has verypoor conversion results for angry and happy. It is likely thatthe Bi-LSTM focuses on a subset of the features to minimizethe overall loss. The cycle-GAN, on the other hand does pro-duce reasonable results even though it is unsupervised. Thisis likely due to the implicit regularization produced by cyclicconsistency and identity loss [28]. Lastly, our proposed modelhas the best conversion score for all three emotion pairs and thetightest error bars in comparison to the baselines. Thus, ourapproach of combining the local and global task in a chainedmodel works extremely well by allowing the individual piecesto train efﬁciently without losing oversight of the end goal.The bottom plot in Figure 4 shows the subjective quality ofspeech reconstruction after emotion conversion measured us-ing mean opinion score (MOS). The chained neural network is * * * * * * * * ** * * * * * * * -p < 0.01

Figure 5:

Conﬁdence of emotion conversion (top) and the qual-ity of reconstruction (bottom) on unseen samples. uniformly better than the baseline algorithms on the VESUSdataset. It means that the proposed approach not only convertsthe emotion with a high degree of conﬁdence but also managesto keep the quality of speech intact after conversion.

We further conduct an out-of-vocabulary emotion conversionexperiment. Here, we set aside 7 randomly selected phrases perspeaker from each emotion category. These phrases are not partof the training set to simulate unseen utterances during testing.Fig. 5 shows the results of this experiment. The GMM resultsare based on single-speaker evaluation. Once again, the pro-posed model has the best conversion performance with narrowerror bounds. The Bi-LSTM does worse on unseen utterancesdemonstrating a lack of generalization capability. On the otherhand, the cycle-GAN degrades a little but the saliency staysabove 0.5 for all three emotion pairs. This is mainly due tothe non-parallel nature of the Cycle-GAN model which makesno assumption on the speakers or the utterances. Our approachachieves this by not normalizing the input features using cohortstatistics. Taken together, conditioning the spectrum estimationon the pitch can learn a complex relationship between the twowhich can be efﬁciently exploited as in our case.The MOS in Fig. 5 show that Bi-LSTM has the best qual-ity of reconstruction among the three. Empirically, it does notmodify the speech at all, thereby, making it sound more nat-ural by default. There is a tie for the second place betweenCycle-GAN and the proposed model. Our proposed approachhas much smaller error bars than Cycle-GAN due to trainingwith un-normalized features and momenta regularization.

4. Conclusions

We have proposed a novel method for emotion conversion thatmodiﬁes pitch and spectrum using a chained neural network.Our proposed approach used a latent variable to regularize theF0 estimation, which in turn affects the spectrum prediction.We showed that using a diffeomorphic prior on the F0 contourand conditioning of spectrum on it leads to better generaliza-tion on unseen utterances. The experiments were carried out onthe VESUS dataset and results on converted test samples werestatistically signiﬁcant. We ﬁnally conclude that our proposedalgorithm did not degrade the quality of speech during conver-sion, thereby, exhibiting its all-round performance. . References [1] K. R. Scherer, “Vocal communication of emotion: A review ofresearch paradigms,”

Speech Communication , vol. 40, no. 1, pp.227 – 256, 2003.[2] D. Schacter, D. T. Gilbert, and D. M. Wegner,

Psychology (2ndEdition) . Worth Publications, 2011.[3] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio,”

CoRR , vol.abs/1609.03499, 2016.[4] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous,Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by con-ditioning wavenet on mel spectrogram predictions,”

CoRR , vol.abs/1712.05884, 2017.[5] J.-M. Valin and J. Skoglund, “Lpcnet: Improving neural speechsynthesis through linear prediction,”

ICASSP , pp. 5891–5895, 052019.[6] R. Valle, J. Li, R. Prenger, and B. Catanzaro, “Mellotron: Mul-tispeaker expressive voice synthesis by conditioning on rhythm,pitch and global style tokens,”

CoRR , vol. abs/1910.11997, 2019.[7] Y. Wang, R. J. Skerry-Ryan, Y. Xiao, D. Stanton, J. Shor, E. Bat-tenberg, R. Clark, and R. A. Saurous, “Uncovering latent style fac-tors for expressive speech synthesis,”

CoRR , vol. abs/1711.00520,2017.[8] E. Battenberg, S. Mariooryad, D. Stanton, R. Skerry-Ryan,M. Shannon, D. Kao, and T. Bagby, “Effective use of variationalembedding capacity in expressive end-to-end speech synthesis,”

CoRR , vol. abs/1906.03402, 2019.[9] Y. Ning, S. He, Z. Wu, C. Xing, and L.-J. Zhang, “A review ofdeep learning based speech synthesis,”

Applied Sciences , vol. 9,p. 4050, 09 2019.[10] R. Aihara, R. Takashima, T. Takiguchi, and Y. Ariki, “Gmm-basedemotional voice conversion using spectrum and prosody features,”

American Journal of Signal Processing , vol. 2, pp. 134–138, 122012.[11] H. Ming, D.-Y. Huang, L. Xie, J. Wu, M. Dong, and H. Li, “Deepbidirectional lstm modeling of timbre and prosody for emotionalvoice conversion,” in

Proc. Interspeech 2016 , 09 2016, pp. 2453–2457.[12] R. Shankar, J. Sager, and A. Venkataraman, “A Multi-SpeakerEmotion Morphing Model Using Highway Networks and Maxi-mum Likelihood Objective,” in

Proc. Interspeech 2019 , 2019, pp.2848–2852.[13] R. Shankar, H.-W. Hsieh, N. Charon, and A. Venkataraman, “Au-tomated Emotion Morphing in Speech Based on DiffeomorphicCurve Registration and Highway Networks,” in

Proc. Interspeech2019 , 2019, pp. 4499–4503.[14] K. Zhou, B. Sisman, and H. Li, “Transforming spectrum andprosody for emotional voice conversion with non-parallel train-ing data,”

CoRR , vol. abs/2002.00198, 2020.[15] M. F. Beg, M. I. Miller, A. Trouv´e, and L. Younes, “Computinglarge deformation metric mappings via geodesic ﬂows of diffeo-morphisms,”

International journal of computer vision , vol. 61, no.139-157, 2005.[16] S. C. Joshi and M. I. Miller, “Landmark matching via large defor-mation diffeomorphisms,”

IEEE transactions on image process-ing , vol. 9, no. 8, pp. 1357–1370, 2000.[17] J. Sager, R. Shankar, J. Reinhold, and A. Venkataraman, “VESUS:A Crowd-Annotated Database to Study Emotion Production andPerception in Spoken English,” in

Proc. Interspeech 2019 , 2019,pp. 316–320.[18] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign, “Re-structuring speech representations using a pitch-adaptive timefre-quency smoothing and an instantaneous-frequency-based f0 ex-traction: Possible role of a repetitive structure in sounds,”

SpeechCommunication , vol. 27, pp. 187–207, 04 1999. [19] M. Morise, F. YOKOMORI, and K. Ozawa, “World: A vocoder-based high-quality speech synthesis system for real-time appli-cations,”

IEICE Transactions on Information and Systems , vol.E99.D, pp. 1877–1884, 07 2016.[20] A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable medi-cal image registration: A survey,”

IEEE Transactions on MedicalImaging , vol. 32, no. 7, pp. 1153–1190, 2013.[21] L. Younes,

Shapes and Diffeomorphisms . Springer-Verlag BerlinHeidelberg, 2010.[22] H.-W. Hsieh and N. Charon, “Diffeomorphic registration of dis-crete geometric distributions,”

CoRR , vol. abs/1801.09778, 2018.[23] A. Radford, L. Metz, and S. Chintala, “Unsupervised representa-tion learning with deep convolutional generative adversarial net-works,” in

Proc. ICLR 2016 , 2016.[24] T. Kaneko and H. Kameoka, “Parallel-data-free voice conver-sion using cycle-consistent adversarial networks,”

CoRR

CoRR , vol. abs/1412.6980, 2015.[27] J. Kominek and A. W Black, “The cmu arctic speech databases,”

SSW5-2004 , 01 2004.[28] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in