Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, Yusuke Ijima
MModel architectures to extrapolate emotional expressions in DNN-based text-to-speech
Katsuki Inoue a , Sunao Hara a , Masanobu Abe a , Nobukatsu Hojo b , Yusuke Ijima b a Graduate school of Interdisciplinary Science and Engineering in Health Systems, Okayama University, Japan b NTT Corporation, Japan
Abstract
This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of “extrapolate emotional expressions” is to borrow emotional expressionsfrom others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potentialpower to construct DNN-based TTS with emotional expressions and some DNN-based TTS systems have demonstrated satisfactoryperformances in the expression of the diversity of human speech, it is necessary and troublesome to collect emotional speech utteredby target speakers. To solve this issue, we propose architectures to separately train the speaker feature and the emotional featureand to synthesize speech with any combined quality of speakers and emotions. The architectures are parallel model (PM), serialmodel (SM), auxiliary input model (AIM), and hybrid models (PM&AIM and SM&AIM). These models are trained throughemotional speech uttered by few speakers and neutral speech uttered by many speakers. Objective evaluations demonstrate that theperformances in the open-emotion test provide insu ffi cient information. They make a comparison with those in the closed-emotiontest, but each speaker has their own manner of expressing emotion. However, subjective evaluation results indicate that the proposedmodels could convey emotional information to some extent. Notably, the PM can correctly convey sad and joyful emotions at a rateof > Keywords:
Emotional speech synthesis, Extrapolation, DNN-based TTS, Text-to-speech, Acoustic model, Phoneme durationmodel
1. Introduction
This paper proposes architectures that facilitate the extrapo-lation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). Text-to-speech (TTS), is a tech-nology that generates speech from text. A variety of TTSmethods have been proposed to generate natural, intelligi-ble, and human-like speech. Recently, deep neural network(DNN)-based TTS has been intensively investigated, and theresults demonstrated that DNN-based TTS can outperform hid-den Markov model (HMM)-based TTS in the quality and nat-uralness of synthesized speech [1, 2, 3, 4]. First, a feed-forward neural network (FFNN) was proposed as a replacementfor decision-tree approaches in HMM-based TTS [1]. Subse-quently, long short-term memory (LSTM)-based recurrent neu-ral network (RNN) has been adopted and provided better nat-uralness and prosody because of their capability to model thelong-term dependencies of speech [5].In addition to the quality and naturalness, DNN-based TTShas advantages in capabilities for controlling voice aspects.For example, to control speaker identity, several multi-speakermodels have been proposed [6, 7, 8]. To control speakerchanges, a method using auxiliary vectors of the voice’s gen-der, age, and identity was proposed [9]. Additionally, a multi-language and multi-speaker model was built by sharing dataacross languages and speakers [10].In terms of voice aspects, emotional expression is one of themore important features. Notably, DNN-based TTS was ad- ditionally proposed to synthesize speech with emotions. Forexample, in the simplest approach, usage of the emotional one-hot vector was proposed by [11]. To control emotional strength,a method that used an auxiliary vector based on listener per-ception was proposed by [12]. By using a speaker adapta-tion method, Yang et. al [13] proposed a method that gen-erated emotional speech from a small amount of emotionalspeech training data. In all aforementioned methods, the targetspeaker’s emotional speech is necessary for training. However,in general, it is di ffi cult for individuals to utter speech with aspecific emotion and to continue speaking with the emotion fora few hours. In conclusion, recording the target speaker’s emo-tional speech is a bottleneck in constructing DNN-based TTSthat can synthesize emotional speech with a particular speaker’svoice quality.To overcome the problem, one possible approach is extrap-olation. Emotional expression models are trained using speechuttered by a particular person, and the models are applied toanother person to generate emotional speech with the person’svoice quality. In other words, a collection of emotional speechuttered by a target speaker is not required, and emotional ex-pression is generated using models trained by the emotionalspeech of another person. In summary, the meaning of extrap-olation is to borrow emotional models from another individual.Based on this approach, several methods have been proposed,for example, in HMM-based TTS, methods that can generateemotional speech in extrapolation conditions. Kanagawa et al. Preprint submitted to Speech Communication February 23, 2021 a r X i v : . [ ee ss . A S ] F e b uggested generating speaker-independent transformation ma-trices using pairs of neutral and target-style speech, and apply-ing these matrices to a neutral-style model of a new speaker[14]. Similarly, Trueba et al. [15] proposed to extrapolate theexpressiveness of proven speaking-style models from speakerswho utter speech in a neutral speaking style. The proposal in-cluded using a constrained structural maximum a posteriori lin-ear regression (CSMAPLR) algorithm [16]. Ohtani et al. pro-posed an emotion additive model to extrapolate emotional ex-pression for a neutral voice [17]. All the aforementioned meth-ods above suggest that the extrapolation of emotional expres-sions is possible by separately modeling the emotional expres-sions and the speaker identities.Based on the extrapolation approach, we propose a novelDNN-based TTS that can synthesize emotional speech. Thebiggest advantage of the proposed algorithm is that we cansynthesize several types of emotional speech with the voicequalities of the multiple speakers. This works even if the tar-get speaker’s emotional speech is not included in training data.A key idea is to explicitly control the speaker factor and theemotional factor, motivated by the success in the multi-speakermodel [6, 7, 8, 9, 10] and multi-emotional model [11, 12, 13].Once the factors are trained, by independently controlling thefactors, we can synthesize speech with any combination of aspeaker and an emotion. As training data, we have emotionalspeech uttered by few speakers, including a neutral speakingstyle, and have only neutral speech uttered by many speakers.The speaker factor must be trained using the neutral speech ut-tered by each speaker, and the emotional factor must be trainedusing the speech uttered by few speakers. To achieve the pur-pose, we examine five types of DNN architectures: parallelmodel (PM), serial model (SM), auxiliary input model (AIM),and the hybrid models (PM&AIM and SM&AIM). The PMdeals with emotional factors and speaker factors in parallel onthe output layer. Also, the SM deals with two factors in serialorder on the last hidden layer and output layer. The AIM dealswith the two factors by using auxiliary one-hot vectors. Dif-fering from those simple models, the hybrid models are com-posed two potential pairs: PM and AIM, or SM and AIM. In[18], we reported the extrapolation of emotional expressions inacoustic feature modeling, and evaluated the performance ofsynthesized speech uttered by only female speakers. Addition-ally, in this paper, we investigate the extrapolation of emotionalexpressions in phoneme duration modeling, and evaluate theperformance of synthesized speech uttered by both males andfemales.This paper is organized as follows. In Section 2, we providean overview of DNN-based TTS and introduce expansions tocontrol multiple voice aspects. In Section 3, we describe theproposed DNN architectures. In Section 4, we explain objec-tive and subjective evaluation. In Section 5, we present ourconclusions and suggestions for further research.
2. DNN-based TTS
DNN-based TTS is a method of speech synthesis that usesthe DNN to map linguistic features to acoustic features. A DNN-based TTS system comprises of text analysis, a phonemeduration model, an acoustic model, and waveform synthesis.The simplest DNN that generates output vector y from inputvector x is expressed by following a recursive formula. h ( (cid:96) ) = f ( (cid:96) ) ( W ( (cid:96) ) h ( (cid:96) − + b ( (cid:96) ) )where 1 ≤ (cid:96) ≤ L , h (0) = x , h ( L ) = y . (1) h ( (cid:96) − ∈ R d (cid:96) − × is the d (cid:96) − dimensional output vector of ( (cid:96) − h ( (cid:96) ) ∈ R d (cid:96) × is the d (cid:96) dimensional output vector of (cid:96) -th layer. Additionally, W ( (cid:96) ) ∈ R d (cid:96) × d (cid:96) − and b ( (cid:96) ) ∈ R d (cid:96) × are theweight matrix, with bias from the ( (cid:96) − (cid:96) -th hidden layer, f ( (cid:96) ) ( · ) is the activation function on the (cid:96) -thhidden layer, and L -th layer is the output layer.To control the voice aspects, for example, the speaker iden-tity, speaking style, and emotional expression, the DNN archi-tecture was expanded in two ways: input and output. Expan-sion of input is a common way to control the voice aspects andis known as feature-embedding. Also, expansion of output is anewly proposed way in our research [18] and inspired by mul-titask learning DNN [3]. The input of the (cid:96) -th layer in eq.1 h ( (cid:96) − can be expanded asfollows, h ( (cid:96) ) = f ( (cid:96) ) ( W ( (cid:96) ) a h ( (cid:96) − a + b ( (cid:96) ) ) (2) h ( (cid:96) − a = (cid:34) h ( (cid:96) − v ( (cid:96) − a (cid:35) (3) W ( (cid:96) ) a = (cid:104) W ( (cid:96) ) W a (cid:105) (4)The input vector h ( (cid:96) − a ∈ R ( d (cid:96) − + d a ) × comprises the input to the (cid:96) -th layer h ( (cid:96) − and the auxiliary vector v ( (cid:96) − a ∈ R d a × . W ( (cid:96) ) a ∈ R d (cid:96) × ( d (cid:96) − + d a ) is the weight matrix in the (cid:96) -th layer. The e ff ectscaused by the auxiliary vector are spread throughout the entiremodel, which results in controlling the factors as a black box.As the auxiliary vector, An et al. [11] applied a one-hot vec-tor that indicated emotions to all layers (1 ≤ (cid:96) ≤ L ) for multi-emotional modeling. Additionally, Wu et al. [7] applied ani-vector [19] to the first hidden layer ( (cid:96) =
1) for multi-speakermodeling, and Hojo et al. [8] applied a one-hot vector to alllayers (1 ≤ (cid:96) ≤ L ) for multi-speaker modeling. The outputs of (cid:96) -th layer in eq.1 h ( (cid:96) ) can be expanded as fol-lows, h ( (cid:96) ) = (cid:104) h ( (cid:96) )(1) h ( (cid:96) )(2) . . . h ( (cid:96) )( I ) (cid:105) v ( (cid:96) ) a (5) h ( (cid:96) )( i ) = f ( (cid:96) )( i ) ( W ( (cid:96) )( i ) h ( (cid:96) − + b ( (cid:96) )( i ) ) (6)The output h ( (cid:96) )( i ) ∈ R d (cid:96) × that corresponds to the i -th factor ofauxiliary vector v ( (cid:96) ) a ( i ) ∈ R is calculated from the shared input h ( (cid:96) − . The output of the (cid:96) -th layer h ( (cid:96) ) is a weighted sum of h ( (cid:96) )( i ) peaker IDEmotion ID Speaker NEU JOY SAD A 〇 〇 〇 B 〇 〇 〇 C 〇 〇 〇 D 〇 〇 〇 (a) Training step(b) Generating step Emotion
Table of synthesized speech
DNNdurationmodel DNNacousticmodel
TextEmotionalexpressionSpeakeridentity PhonemedurationDurationanalysis Phonemeduration
Loss (MSE)
Linguisticfeatures AcousticfeaturesAcousticfeaturesSTRAIGHTanalysisTextanalysisOne-hotencoding
Loss (MSE)DNNdurationmodel DNNacousticmodel
Phonemeduration AcousticfeaturesSpeaker IDEmotion IDLinguisticfeatures STRAIGHTsynthesisTextEmotionalexpressionSpeakeridentity TextanalysisOne-hotencoding
CorpusInput data
Speaker NEU JOY SAD A 〇 〇 〇 B 〇 C 〇 〇 〇 D 〇 Emotion
Table of training speech (c) Speech configuration
Featuretransformation(phoneme-basis to frame-basis)Featuretransformation(phoneme-basis to frame-basis)
Figure 1: The proposed method of emotional speech synthesis in DNN-based TTS. In the table of training data, (cid:13) indicates data used for training, — indicatesdata not used for training. In the table of synthesized speech, (cid:13) indicates data that the system can synthesize, hatching boxes indicate the extrapolation conditions,normal boxes indicate the interpolation conditions. that uses the i -th factor of an auxiliary vector v ( (cid:96) ) a ( i ) as the weight.When the auxiliary vector v ( (cid:96) ) a is a one-hot vector, the formulais related to DNN by using multitask learning [20], which is atechnique wherein a primary learning task is solved jointly withadditional related tasks. In multitask learning DNN [3], themodel has a shared hidden layer h ( (cid:96) − that can be consideredthe task-independent transformation. Additionally, the modelhas multiple output layers corresponding to each task.In multi-speaker DNN [6] and emotional speech synthesis byspeaker adaptation [13], the model of each speaker has its ownoutput layer, that is, the first task is speaker A, the second taskis speaker B.
3. Proposed method
Figure 1 presents the proposed DNN-based TTS that gener-ates emotional speech by combining the emotional factor andthe speaker factor. In the training step presented in Fig. 1 (a),multi-speaker and multi-emotional speech data are used, wherespeakers and types of emotions are unbalanced. That is, manyspeakers utter only neutral speech and few speakers utter bothneutral and emotional speech. To synthesize emotional speechwith the voice quality of the speakers who only utter neutralspeech, DNNs must have an architecture to separately train theemotional factor and the speaker factor by introducing the aux-iliary vectors. Details of the architecture are explained in 3.3.In Fig. 1 (a), a phoneme duration model and acoustic model aretrained. Both models have the same DNN architecture but dif-ferent model parameters. In the speech synthesis step presented in Fig. 1 (b), DNNs generate the target phoneme duration andthe target acoustic features by setting speaker ID and emotionID by using the auxiliary vectors. In Fig. 1 (c), because anycombination of speaker ID and emotion ID is possible, we syn-thesize emotional speech with the voice quality of the speakerswho only utter neutral speech.
Two types of a vector called emotion ID and speaker ID areused as features to control the emotional expression and speakeridentity. Several methods have been proposed to control speak-ers or emotions, for example, one-hot vector [8, 11], i-vector[7], d-vector [21], x-vector [22]. Because the one-hot vector issimple and intuitive, we adopt its features to control emotionsand speakers.The emotion ID E ( i ) for the i -th emotion is defined as E ( i ) = (cid:104) e ( i )1 , e ( i )2 , ..., e ( i ) M (cid:105) (cid:62) , where each value e ( i ) m is expressed as follows: e ( i ) m = m = i (7)where M is the dimension of E ( i ) and equal to the number ofemotions in the training data. Additionally, m = i is 1 if m = i is true, and 0 otherwise. To represent the neutral emotion, theemotion ID is a zero vector.In the same manner, the speaker ID S ( j ) for the j -th speakeris defined as S ( j ) = (cid:104) s ( j )1 , s ( j )2 , ..., s ( j ) N (cid:105) (cid:62) , where each value s ( j ) n isexpressed as follows: s ( j ) n = n = j (8)where N is the dimension of S ( j ) and equal to the number ofspeakers in the training data.3 able 1: Data configuration for training and evaluation. (In the model training, (cid:13) indicates that data was included, and — indicates that data was not included. Inthe evaluation, the hatching box indicates the synthesized speech used for the evaluation. N indicates neutral, J indicates joyful, and S indicates sad.) Type of synthesized speechfor the evaluation experiment corpus α corpus β Female A Female B Male A Male B 12 speakersN J S N J S N J S N J S N J S(a) Open-emotion test of female A (cid:13) — — (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) — — (cid:13) — — (b) Open-emotion test of female B (cid:13) (cid:13) (cid:13) (cid:13) — — (cid:13) (cid:13) (cid:13) (cid:13) — — (cid:13) — — (c) Closed-emotion test of female A / B (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) — — (cid:13) — — (d) Speaker-and-emotion-dependent (SED)test of female A’s joyful — (cid:13) — — — — — — — — — — — — — (e) SED test of female A’s sad — — (cid:13) — — — — — — — — — — — — (f) SED test of female B’s joyful — — — — (cid:13) — — — — — — — — — — (g) SED test of female B’s sad — — — — — (cid:13) — — — — — — — — — Acoustic featuresLinguistic featuresWeighted sumEmotional parts( e , …, e M ) Speaker parts( s , …, s N )Shared partEmotion ID Speaker ID Figure 2: The parallel model (PM)
We propose five types of DNNs that can separately controlthe speaker factor and emotional factor.
In Fig. 2, a PM has an output layer comprised of emotion-dependent parts (Emotion 1, Emotion 2, ..., Emotion M),speaker-dependent parts (Speaker 1, Speaker 2, ..., Speaker N),and a shared part. The PM uses (cid:104) E ( i ) (cid:62) S ( j ) (cid:62) (cid:105) (cid:62) as the auxil-iary vector v ( (cid:96) ) a in the output layer (cid:96) = L at eq.5. The outputsof the emotion-dependent part and speaker-dependent part aresummed linearly, because the linear activation function is usedat the output layer. The PM is newly proposed and is moti-vated by a multi-speaker DNN [6] and the emotion additivemodel [17], where hidden layers are regarded as a linguisticfeature transformation shared by all speakers [23]. Because theacoustic feature is represented as the addition of the emotional-dependent part, and the speaker-dependent part, the emotionalfactor and speaker factor are separately controlled. In a SM, the speaker factor and emotional factor are sequen-tially modeled in di ff erent layers. The two types of architec-tures are: that the SM se , which models the speaker factor in theformer layer, and then models the emotional factor in the laterlayer: the other type of architecture is the SM es , which mod-els the two factors in reverse order. The SM se uses (cid:104) S ( j ) (cid:62) (cid:105) (cid:62) as the auxiliary vector v ( (cid:96) ) a in the last hidden layer (cid:96) = L − (cid:104) E ( i ) (cid:62) (cid:105) (cid:62) as the auxiliary vector v ( (cid:96) ) a in the outputlayer (cid:96) = L at eq.5. Opposite to SM se , SM es uses (cid:104) E ( i ) (cid:62) (cid:105) (cid:62) inthe last hidden layer, and (cid:104) S ( j ) (cid:62) (cid:105) (cid:62) in the output layer. As withthe PM, in the output layer, the output of the speaker-dependentpart or emotion-dependent part is summed linearly. However,in the hidden layer, the output of the emotion-dependent partor speaker-dependent part is summed nonlinearly because thesigmoid activation function is used. An AIM implicitly models the speaker factor and emotionalfactor by forwarding the value of the auxiliary input vector. TheAIM uses (cid:104) E ( i ) (cid:62) S ( j ) (cid:62) (cid:105) (cid:62) in the auxiliary vector v ( (cid:96) ) a of the inputlayer (cid:96) = As hybrid models, two types of model are used, namely thecombination of PM and AIM (PM&AIM), and the combinationof SM and AIM (SM&AIM). The PM&AIM uses the auxiliaryvector if the input layer (cid:96) = (cid:96) = L . Also,the SM&AIM uses the auxiliary vector if the input layer (cid:96) = (cid:96) = L − (cid:96) = L .
4. Evaluation experiments
To evaluate the extrapolation performance of the proposedarchitectures, open-emotion tests and closed-emotion tests areconducted objectively and subjectively. Here, open-emotionand closed-emotion mean that, in the training step, the emo-tional speech of a target speaker is not included and is included,respectively.4 able 2: The DNN architectures for the phoneme duration model and the acoustic model. (The hatching box indicates the expanded part of the simplest DNN.) (a) Phoneme duration model (b) Acoustic modelModel
Inputlayer Hiddenlayer 1 Hiddenlayer 2 Outputlayer Inputlayer Hiddenlayer 1 Hiddenlayer 2 Hiddenlayer 3 Outputlayer
PM 298 32 32 + +
305 256 256 256 + + SM se
298 32 +
1) 1*(2 +
305 256 256 +
1) 154*(2 + SM es
298 32 +
1) 1*(16 +
305 256 256 +
1) 154*(16 + AIM + +
64 64 1 + +
512 512 512 154PM&AIM + +
32 32 + +
1) 305 + +
256 256 256 + + SM se &AIM + + +
1) 1*(2 +
1) 305 + +
256 256 +
1) 154*(2 + SM es &AIM + + +
1) 1*(16 +
1) 305 + +
256 256 +
1) 154*(16 + SED 298 32 32 1 305 256 256 256 154
In the experiments, two types of Japanese speech corpus, cor-pus α and β , were used. In corpus α , the same 500 sentenceswere uttered in several ways. Two female speakers and a malespeaker uttered the sentences with neutral, joyful, and sad emo-tions; and a male speaker uttered them with only neutral emo-tion. Each dataset’s duration of each dataset was approximately35 minutes, and the duration of the entire “corpus α ” is approx-imately 350 minutes. In corpus β , the same 130 sentences thatdi ff ered from corpus α , were uttered with only neutral emo-tion by six female and six male speakers. The duration of eachdataset was approximately 40 minutes, and the duration of theentire corpus β is approximately 480 minutes. The speech sig-nals were sampled at 22 .
05 kHz and quantized at 16 bits. Thephoneme duration was manually annotated and labeled with thesame format as the hidden Markov model toolkit (HTK) [24].Speech data were divided into a training-set, validation-set,and test-set with a rate of 90%:5%:5%, that is, corpus α wasdivided into 450:25:25, and corpus β was divided into 120:5:5. Table 1 presents a summary of training and test data. A circlein the table indicates that 95% of the data (the training-set) isused for training, and a dash in the table indicates that speechis not included in the training. Hatching in the table indicatesdata used for the evaluation. A circle in the hatching box in-dicates that 5% of the data (the test-set) is used for evalua-tion. For the open-emotion test, DNN was trained using data(a) and (b) in Table 1. The synthesized speech is comparedwith the real emotional speech uttered by female speakers Aand B. For the closed-emotion test, DNN is trained using data(c) in Table 1. The synthesized speech is compared with the realemotional speech uttered by female speakers A and B. As ref-erences, speaker-and-emotion-dependent models (SEDs) weretrained using the training-set of (d), (e), (f), and (g), and evalu-ated using their test-sets.
The proposed models and SEDs are trained using thedatabase described in 4.1. The STRAIGHT [25] analy-sis is used to extract the spectral envelop, aperiodicity, F0, and voiced / unvoiced flag in a 5-ms frame shift. Next, 40-dimensional Mel-cepstral coe ffi cients, 10-band-aperiodicities,and F0 in log-scale are calculated. Notably, 80% of the silentframes are removed from the training data to avoid increasingthe proportion of silence in the training data and reduce thecomputational cost. For the PM, SM, and SED, the input feature vectors are289-dimensional binary features of categorical linguistic con-texts (e.g., quinphone, the interrogative sentence flag), and 9-dimensional numerical linguistic contexts (e.g., the number ofmora in the current word, the relative position toward the ac-cent nucleus in the current mora). For the AIM, PM&AIM, andSM&AIM, as auxiliary features, speaker and emotion IDs areadded to the input feature vectors. Because the speaker ID is 16dimensions and the emotion ID is 2 dimensions, the dimensionof the input vector becomes 316.The output feature is the integer scalar value that indicatesthe number of frames (i.e., phoneme duration). The output fea-tures of the training data are normalized to zero mean and unitvariance.As DNN model architectures, the FFNNs presented in Table2 (a) are used. A sigmoid function is used in the hidden layersfollowed by a linear activation at the output layer. For the train-ing process, the weights of all DNN (PM, SM, AIM, PM&AIM,SM&AIM and SED) are randomly initialized. The weights aretrained using a backpropagation procedure with a minibatch-based MomentumSGD to minimize the mean squared error be-tween the output features of the training data and the predictedvalues. The initial learning rate of MomentumSGD is 0 .
16 (PM,SM, PM&AIM, SM&AIM, and SED) or 0 .
08 (AIM), and themomentum is 0 .
9. The training data for the minibatch is ran-domly selected, and the minibatch size is 64 (PM, SM, AIM,PM&AIM, and SM&AIM) or 16 (SED). The schedule of thetraining is a similar method, to randomly select the data as con-ventional DNN. The hyper-parameters of each model were se-lected by a grid search that has a higher performance based ona smaller number of parameters.
For the PM, SM, and SED, 7-dimensional time features (e.g.,the total frame number of current mora, the current state in 5-5 OY SAD
Root mean squared error of phoneme duration [ms]
PM&AIM
SMse&AIMSMes&AIMPM
SMse
SMesAIMPM&AIMSMse&AIM
SMes&AIM Open-emotion test Closed-emotion test SED
Figure 3: Objective evaluation results of the RMSE of phoneme duration state) are added to the feature vector used in the phoneme du-ration model. Thus, the dimension of the input feature vectoris 305. The dimension of the input feature vectors for the AIM,PM&AIM, and SM&AIM, is 323 because both the speaker andemotion IDs are added as auxiliary features in the same manneras in the phoneme duration model.The output feature vector contains log F0, 40 Mel-cepstralcoe ffi cients, 10-band-aperiodicities, their delta and delta-deltacounterparts, and a voiced / unvoiced flag, which results in 154dimensions. The voiced / unvoiced flag is a binary feature thatindicates the voicing of the current frame. The output featuresof the training data are normalized to zero mean and unit vari-ance. In these experiments, phoneme durations extracted fromnatural speech are used.As DNN model architectures, the FFNNs presented in Ta-ble 2 (b) were used. Activation function, loss function, op-timizer, and training schedules are the same condition as inSection 4.2.1. The initial learning rate of MomentumSGD is1 .
28 (PM, SM, AIM, PM&AIM, SM&AIM, and SED), and themomentum is 0 .
9. The training data for the minibatch is ran-domly selected, and the minibatch size is 128 (PM, SM, AIM,PM&AIM, SM&AIM and SED). The hyper-parameters of eachmodel are selected by using the same rule we used with thephoneme duration model.
To make sure the advantages of the proposed architectures,objective evaluations were performed by comparing the esti-mated values using the models with extracted values from realemotional speech. Another aim of the experiments is to knowthe upper limit of interpolation by comparing the estimated val-ues of the open-emotion test with those of the closed-emotiontest.
The phoneme duration models are evaluated by the root meansquared error (RMSE) of phoneme duration that is calculatedbetween the phoneme duration extracted from real emotionalspeech and the phoneme duration generated by the phonemeduration model.
500 0 0.5 1 1.5 2 2.5 3
Target Mean of targetPM(closed) Mean of PM(closed)
SED Mean of SED
Time [s] F [ H z ] Model Corr. of logF0
RMSE of logF0
PM(closed) 0.81 287.96
SED 0.63 465.65
Figure 4: The comparison of F0 of female B’s joyful speech. The upper righttable shows the values that are calculated using the shown F0 patterns.
Figure 3 presents the experimental results. Compared withthe open-emotion test and the closed-emotion test, the di ff er-ence is less than 5 ms except for the joyful speech of SM se andSM se &AIM. The di ff erences are small and might not be per-ceivable by hearing. Based on the results, for phoneme dura-tion modeling, the proposed model works well, and we do notcollect emotional speech uttered by each speaker. The acoustic models are evaluated by the correlation coef-ficient of log F0, the RMSE of log F0, and the Mel-cepstraldistortion (MCD). The objective measure is calculated betweenthe parameters extracted from the real emotional speech and theparameters generated by the acoustic model. Figure 4 showsthe F0 contours and the average F0 of joyful speech that is gen-erated by the SED and PM in the closed-emotion test as wellas those extracted from the target speaker’s joyful speech. Theupper right table indicates the correlation coe ffi cient of log F0and the RMSE of log F0 calculated from this single utterance.It is observed that the correlation coe ffi cient of log F0 and theRMSE of log F0 are useful for evaluating whether the F0 con-tour and average F0 are similar to the target speech.Figure 5 presents the results for the correlation coe ffi cientof log F0. In closed-emotion test, the SED has poorer perfor-mance than the proposed models. The main reason for the re-sults is that the SED takes approximately 35 minutes of trainingdata, while the proposed model takes approximately 760 min-utes. Moreover, corpus β contains di ff erent texts from corpus α , which results in increasing variations in phoneme contexts.This is another advantage of the proposed approach; i.e., we cane ff ectively use speech data from many speakers. Interestingly,even in the open-emotion test, all the proposed models exceptthe SM se and SM se &AIM have the same or better performancecompared with the SED. This is because the all speakers ut-tered the same text with several emotions. Because Japanese isa tonal language, F0 patterns are important to convey meanings.So speakers cannot drastically change the shape of F0 patterns,but can change only the height or length of F0 patterns to ex-press emotions. Therefore, the correlation coe ffi cient of log F0showed fairly good in open-emotion tests. The degradation ofthe coe ffi cient is only 0 . OY SAD 0.5 0.6 0.7 0.8 0.9PMSMseSMesAIM
PM&AIMSMse&AIM
SMes&AIMPMSMseSMes
AIMPM&AIM
SMse&AIMSMes&AIM
Open-emotion test Closed-emotion test SED
Correlation coefficient of log F0
Figure 5: Objective evaluation results of the correlation coe ffi cient of log F0 JOY
SAD 0 100 200 300 400 500 600PMSMseSMesAIM
PM&AIMSMse&AIM
SMes&AIMPMSMseSMes
AIMPM&AIM
SMse&AIMSMes&AIM
Open-emotion test Closed-emotion test SED
Root mean squared error of log F0 [cent]
Figure 6: Objective evaluation results of the RMSE of log F0 contrast to the correlation coe ffi cient of log F0, in the open-emotion test, all the proposed models have poorer performancesthan the SED. This mainly occurs because each speaker hastheir own means to control F0 contours to express emotion.Figure 7 presents the results for the MCD. In the open-emotion test, all the proposed models again have poorer per-formances than the SED. This indicates that each speaker alsochanges their articulation in their own fashion to express emo-tion and the F0 contour.In terms of the overall performance of the models, we cansay that SM se and SM es are not promising, because their per-formance showed di ff erent tendencies and is not predictable.This was shown in the correlation coe ffi cient of log F0 and inthe RMSE of log F0. As well as these simple SMs, hybridSMs, namely SM se &AIM and SM es &AIM, are not promisingin overall performance. According to the results of objective evaluations, emotionalexpressions highly depend on each speaker. However, emo-tional expression is useful even though the means of expressionis not the same as thier own means. Therefore, subjective testsare conducted to examine to what extent emotional expressionsare reproduced in the open-emotion test. Firstly, to confirm the
JOY
SAD 4 5 6 7PMSMseSMesAIM
PM&AIMSMse&AIM
SMes&AIMPMSMseSMes
AIMPM&AIM
SMse&AIMSMes&AIM
Open-emotion test Closed-emotion test SED
Mel-cepstral distortion [dB]
Figure 7: Objective evaluation results of the MCD basic performance, naturalness and speaker similarity are eval-uated by the mean opinion score (MOS). Then, emotion identi-fication tests are performed in the open-emotion test.
To evaluate the naturalness of synthesized speech, the MOStest was carried out. For the open-emotion test, stimuli are syn-thesized by the PM, SM se , SM es , AIM, PM&AIM, SM se &AIM,and SM es &AIM. However, for the closed-emotion test, stimuliare only synthesized by the PM and AIM, because the simpleand hybrid SMs showed bad performance in the correlation co-e ffi cient of log F0. As reference speech, the SED and a resyn-thesized speech by STRAIGHT (resyns) are employed. Forty-eight sentences (twelve sentences covering two emotions fromtwo female speakers in corpus α of Table 1) were synthesizedusing each model. A five-point scale (1, very unnatural, to 5,very natural) was used for the MOS.To evaluate speaker similarity, the MOS test was performed,where the quality of synthesized emotional speech was com-pared to target neutral speech (resyns). The models used in theexperiment were the same as the naturalness test. Twenty-foursentences (six sentences covering two emotions from two fe-male speakers in corpus α of Table 1) were synthesized for eachpair. A five-point scale (1, very dissimilar, to 5, very similar)was used for the MOS.For both MOS tests, fifteen Japanese listeners participated.The order of presenting the stimuli was randomly selected, butthe order was the same for all participants. In the speech synthe-sis, phoneme durations extracted from the neutral speech wereused, and acoustic features were smoothed by maximum likeli-hood parameter generation (MLPG) [26]. The variance of thesefeatures was expanded to the global variance (GV) [27], ex-tracted from the target neutral speech, by using variance scaling[28]. Figure 8 presents the results of the MOS test for naturalness.In the closed-emotion test, the PM and AIM show better natu-ralness than in the open-emotion test. However, the di ff erence7 .73 2.14 2.64 SED
SMse (open)SMse&AIM (open)SMes (open)SMes&AIM (open)PM (open)PM&AIM (open)AIM (open)AIM (closed)
PM (closed)
Resyns
Mean Opinion Score of Naturalness
Figure 8: MOS test of naturalness results with their 95% confidence interval.
SMes&AIM (open)
PM (open)PM&AIM (open)
AIM (open)
AIM (closed)PM (closed)
Resyns
Mean Opinion Score of Speaker Similarity
Figure 9: MOS test of speaker similarity results with their 95% confidenceinterval. between closed- and open-emotion tests is smaller than the dif-ference between Resyns and closed-emotion tests. In the open-emotion test, the SM se &AIM indicates significantly better per-formance than SM se . This result shows the AIM mechanismhelps to improve the performance of SM se . But this is not truefor the PM and SM es .Figure 9 presents the results of the MOS test for speaker sim-ilarity. As shown in the figure, even the resynthesized voiceachieves approximately 3 in opinion score. This indicates thateven though emotional speech is uttered by the same speaker,the speaker identity of emotional speech is a little di ff erent fromthat of neutral speech. Because of this, in terms of speakeridentity, there are small di ff erences between closed- and open-emotion tests. To evaluate the performance in extrapolating emotional ex-pressions, emotion identification tests were carried out in theopen-emotion test. Fifteen Japanese listeners participated in thesubjective test. For the presented synthesized speech, they wereasked to select an emotion from four choices: neutral, sad, joy-ful, and others. As stimuli from the open-emotion test, the PM,SM es , AIM, PM&AIM, and SM es &AIM synthesized 120 sen-tences (five sentences covering three emotions from four male and four female speakers in corpus β of Table 1), the total num-ber of sentences was 600. As a reference stimuli from theclosed-emotion test, the PM is selected. From informal listen-ing tests, the SM was worse, but there are no significant di ff er-ences between the AIM and PM. Besides, the PM has averagedperformance for objective evaluations. The PM synthesized 30sentences (five sentences covering three emotion from two fe-male speakers in corpus α of Table 1). The order of present-ing the stimuli was randomly selected, but the order was thesame for all participants. The speech synthesis procedures arethe same as 4.4.1. A chi-square test was used for evaluatingwhether the open-emotion test has a significant di ff erence withthe closed-emotion test in each correct emotion. Table 3 presents the confusion matrices of participants’choices and correct answers. A symbol (*) indicates p < . > . ff erence is observed between the closed-emotion test (Ta-ble 3 (f)) and open-emotion test (Table 3 (a), (b), (c), (d), and(e)). This mainly occurs because the di ff erences in F0 and thecepstrum for sad are relatively small in Fig. 6 and Fig. 7,and easily trained from other speakers’ speech. Based on theresults, we propose that sadness can be expressed by the pro-posed models (PM, SM es , AIM, PM&AIM, and SM es &AIM).For joyful, however, only the PM demonstrates little di ff erencein the closed-emotion test and open-emotion test. In Fig. 6 andFig. 7, di ff erences in F0 and the cepstrum for joyful are largerthan those for sad. The results indicate that the PM can inde-pendently model the speaker factor and emotional factors evenif large di ff erences are observed in the acoustic parameters toexpress emotion. According to the results, we should selectthe PM from the proposed five models to synthesize emotionalspeech based on the extrapolation approach.
5. Conclusion and future work
In this paper, to generate emotional expressions using DNN-based TTS, we proposed the following five models: PM, SM,AIM, PM&AIM, and SM&AIM. These models are based on thefollowing extrapolation approach: emotional expression mod-els are trained using speech uttered by a particular person, andthe models are applied to another person for generating emo-tional speech with the person’s voice quality. In other words,the collection of emotional speech uttered by a target speakeris unnecessary, and emotional expression is generated usingmodels trained by the emotional speech of another person. Toevaluate the extrapolation performance of the proposed models,an open-emotion test and closed-emotion test were conductedobjectively and subjectively. The objective evaluation resultsdemonstrate that the performances in the open-emotion test areinsu ffi cient on the basis of a comparison to those in the closed-emotion test, because each speaker has their own manners of8xpressing emotion. However, the subjective evaluation resultsindicate that the proposed models can convey emotional infor-mation to some extent, especially, the PM, which can correctlyconvey sad and joyful emotions at a rate of > ff between collecting emotional speech uttered by targetspeakers and the performance of the extrapolation approach. Asmentioned in Section 1 (Introduction), it is di ffi cult for individ-uals to utter speech with a specific emotion, and to continuespeaking with that emotion for an extended time. Thus, no guar-antee can be given that models trained by collecting emotionalspeech uttered by target speakers always outperform the perfor-mance of the extrapolation approaches. The performance of theDNN-based extrapolation method will be clarified by compar-ing it with conventional extrapolation methods. References [1] Heiga Zen, Andrew Senior, and Mike Schuster, “Statistical parametricspeech synthesis using deep neural networks,” in
Proceedings of ICASSP ,2013, pp. 7962–7966.[2] Yao Qian, Yuchen Fan, Wenping Hu, and Frank K Soong, “On the train-ing aspects of deep neural network (DNN) for parametric TTS synthesis,”in
Proceedings of ICASSP , 2014, pp. 3829–3833.[3] Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King,“Deep neural networks employing multi-task learning and stacked bottle-neck features for speech synthesis,” in
Proceedings of ICASSP , 2015, pp.4460–4464.[4] Oliver Watts, Gustav Eje Henter, Thomas Merritt, Zhizheng Wu, and Si-mon King, “From HMMs to DNNs: where do the improvements comefrom?,” in
Proceedings of ICASSP , 2016, pp. 5505–5509.[5] Yuchen Fan, Yao Qian, Feng-Long Xie, and Frank K Soong, “TTS syn-thesis with bidirectional LSTM based recurrent neural networks,” in
Pro-ceedings of INTERSPEECH , 2014, pp. 1964–1968.[6] Yuchen Fan, Yao Qian, Frank K Soong, and Lei He, “Multi-speaker mod-eling and speaker adaptation for DNN-based TTS synthesis,” in
Proceed-ings of ICASSP , 2015, pp. 4475–4479.[7] Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, andSimon King, “A study of speaker adaptation for DNN-based speech syn-thesis,” in
Proceedings of INTERSPEECH , 2015, pp. 879–883.[8] Nobukatsu Hojo, Yusuke Ijima, and Hideyuki Mizuno, “DNN-basedspeech synthesis using speaker codes,”
IEICE TRANSACTIONS on In-formation and Systems , vol. 101, no. 2, pp. 462–472, 2018.[9] Hieu-Thi Luong, Shinji Takaki, Gustav Eje Henter, and Junichi Yamag-ishi, “Adapting and controlling DNN-based speech synthesis using inputcodes,” in
Proceedings of ICASSP , 2017, pp. 4905–4909.[10] Bo Li and Heiga Zen, “Multi-language multi-speaker acoustic modelingfor LSTM-RNN based statistical parametric speech synthesis,” in
Pro-ceedings of INTERSPEECH , 2016, pp. 2468–2472.[11] Shumin An, Zhenhua Ling, and Lirong Dai, “Emotional statistical para-metric speech synthesis using LSTM-RNNs,” in
Proceedings of APSIPAASC , 2017, pp. 1613–1616.[12] Jaime Lorenzo-Trueba, Gustav Eje Henter, Shinji Takaki, Junichi Yam-agishi, Yosuke Morino, and Yuta Ochiai, “Investigating di ff erent repre-sentations for modeling and controlling multiple emotions in DNN-basedspeech synthesis,” Speech Communication , vol. 99, pp. 135–143, 2018. [13] Hongwu Yang, Weizhao Zhang, and Pengpeng Zhi, “A DNN-based emo-tional speech synthesis by speaker adaptation,” in
Proceedings of APSIPAASC , 2018, pp. 633–637.[14] Hiroki Kanagawa, Takashi Nose, and Takao Kobayashi, “Speaker-independent style conversion for HMM-based expressive speech synthe-sis,” in
Proceedings of ICASSP , 2013, pp. 7864–7868.[15] Jaime Lorenzo-Trueba, Roberto Barra-Chicote, Oliver Watts, andJuan Manuel Montero, “Towards speaking style transplantation in speechsynthesis,” in , 2013, pp. 159–163.[16] Junichi Yamagishi, Takao Kobayashi, Yuji Nakano, Katsumi Ogata, andJuri Isogai, “Analysis of speaker adaptation algorithms for HMM-basedspeech synthesis and a constrained SMAPLR adaptation algorithm,”
IEEE Transactions on Audio, Speech, and Language Processing , vol. 17,no. 1, pp. 66–83, 2009.[17] Yamato Ohtani, Yu Nasu, Masahiro Morita, and Masami Akamine,“Emotional transplant in statistical speech synthesis based on emotionadditive model,” in
Proceedings of INTERSPEECH , 2015, pp. 274–278.[18] Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, and YusukeIjima, “An investigation to transplant emotional expressions in DNN-based TTS synthesis,” in
Proceedings of APSIPA ASC . IEEE, 2017, pp.1253–1258.[19] Najim Dehak, Patrick J. Kenny, R´eda Dehak, Pierre Dumouchel, andPierre Ouellet, “Front-end factor analysis for speaker verification,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 19, no. 4,pp. 788–798, 2010.[20] Rich Caruana, “Multitask learning,”
Machine learning , vol. 28, no. 1, pp.41–75, 1997.[21] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, andJavier Gonzalez-Dominguez, “Deep neural networks for small footprinttext-dependent speaker verification,” in
Proceedings of ICASSP , 2014,pp. 4052–4056.[22] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, andSanjeev Khudanpur, “X-vectors: Robust DNN embeddings for speakerrecognition,” in
Proceedings of ICASSP , 2018, pp. 5329–5333.[23] Junichi Yamagishi, Masatsune Tamura, Takashi Masuko, Keiichi Tokuda,and Takao Kobayashi, “A training method of average voice model forHMM-based speech synthesis,”
IEICE TRANSACTIONS on Fundamen-tals of Electronics, Communications and Computer Sciences , vol. 86, no.8, pp. 1956–1963, 2003.[24] Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Ker-shaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, DanPovey, et al., “The htk book,”
Cambridge university engineering de-partment , vol. 3, pp. 75, 2006.[25] Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain De Cheveigne, “Re-structuring speech representations using a pitch-adaptive time-frequencysmoothing and an instantaneous-frequency-based F0 extraction: Possiblerole of repetitive structure in sounds,”
Speech Communication , vol. 27,no. 3, pp. 187–207, 1999.[26] Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, TakaoKobayashi, and Tadashi Kitamura, “Speech parameter generation algo-rithms for HMM-based speech synthesis,” in
Proceedings of ICASSP ,2000, pp. 1315–1318.[27] Tomoki Toda and Keiichi Tokuda, “A speech parameter generation al-gorithm considering global variance for HMM-based speech synthesis,”
IEICE TRANSACTIONS on Information and Systems , vol. 90, no. 5, pp.816–824, 2007.[28] Hanna Sil´en, Elina Helander, Jani Nurminen, and MoncefSil´en Gabbouj,“Ways to implement global variance in statistical speech synthesis,” in
Proceedings of INTERSPEECH , 2012, pp. 1436–1439. able 3: Confusion matrices for subjective emotional classification results(Value indicates the accuracy of classification. * indicates p < .
001 in a chi-square test between the closed-emotion by the PM (f) and the others (a, b, c, d,and e).) (a) Open-emotion by the PMCorrect Judged emotionemotion NEU JOY SAD OTHNEU ∗ ∗ es Correct Judged emotionemotion NEU JOY SAD OTHNEU ∗ es &AIMCorrect Judged emotionemotion NEU JOY SAD OTHNEU ∗0.51