Maximizing Mutual Information for Tacotron
Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Su, Dong Yu
MMAXIMIZING MUTUAL INFORMATION FOR TACOTRON
Peng Liu † , Xixin Wu ‡ , Shiyin Kang † , Guangzhi Li † , Dan Su † , Dong Yu †† Tencent AI Lab ‡ Department of Systems Engineering and Engineering Management,The Chinese University of Hong Kong, China { feanorliu, shiyinkang, guangzhilei, dansu, dyu } @[email protected] ABSTRACT
End-to-end speech synthesis methods already achieve close-to-human quality performance. However compared to HMM-basedand NN-based frame-to-frame regression methods, they are proneto some synthesis errors, such as missing or repeating words andincomplete synthesis. We attribute the comparatively high utteranceerror rate to the local information preference of conditional autore-gressive models, and the ill-posed training objective of the model,which describes mostly the training status of the autoregressivemodule, but rarely that of the condition module. Inspired by Info-GAN, we propose to maximize the mutual information between thetext condition and the predicted acoustic features to strengthen thedependency between them for CAR speech synthesis model, whichwould alleviate the local information preference issue and reduce theutterance error rate. The training objective of maximizing mutual in-formation can be considered as a metric of the dependency betweenthe autoregressive module and the condition module. Experimentresults show that our method can reduce the utterance error rate . Index Terms — speech synthesis, end-to-end, mutual informa-tion, Tacotron, conditional autoregressive model
1. INTRODUCTION
Tacotron [1] and Tacotron2 [2] are conditional autoregressive (CAR)models trained with teacher forcing [3]. The condition is summa-rized from the input text with attention mechanism [4]. Transformer-TTS [5] can be considered as another instance of CAR model, witheffective utilization of self-attention mechanism [6]. Such architec-ture can be trained in an end-to-end way, so it has a much shorterpipeline and needs less expert knowledge and human labor. It isflexible enough to adapt for speaking style [7, 8] and multi-speaker[9, 10]. In addition, it is easy to be combined with neural vocoder[11, 12, 13] to enhance the synthesized waveform quality.Training with teacher forcing induces a mismatch between thetraining period and the inference period, usually known as exposurebias [14]. Even worse, it strengthens the local information prefer-ence [15] for the CAR model. We explain the local informationpreference intuitively first. At each time step during training, theCAR model receives a teacher forcing input and a conditional input.The teacher forcing input is one previous time step from the tar-get. The conditional input is the text to be synthesized. If the CARmodel learns to copy the teacher forcing input, or to predict the targettotally depending on teacher forcing input without using the condi-tional information, it still gets small training root mean square error Code is available at https://github.com/bfs18/tacotron2 (RMSE). Finally the model, which achieves small RMSE, may notlearn to depend on the condition at all. So at the inference period,the CAR model generates results that have nothing to do with thecondition. Note that local information preference still exists even ifteacher forcing is not used. When a random variable x admits au-toregressive dependency over a conditional random variable z , i.e. p ( x | z ) = (cid:81) i ( x i | x
2. WHY CAR TTS MODEL TENDS TO IGNORE THETEXT CONDITION
In this section we first explain local information preference for CARmodel formally. Then we explain why Tacotron still works though ittends to ignore the text condition.
Usually we perform maximum likelihood estimation (MLE) totrain a CAR speech synthesis model. And the model communi-cates information form the text to the acoustic features through thetime-aligned latent variables. Such latent variables exist in vari-ous speech recognition and synthesis systems, such as the hiddenstates in the HMM-based speech synthesis system, the forward-backward search matrix in the CTC recognizer, and the attentionvariables in Tacotron. We can formalize the CAR speech synthe-sis model as a variational encoder-decoder (VED) [18]. We use t and x to represent a text and its corresponding acoustic featuresin the training set. Since it is a CAR model, the conditional like-lihood can be written as log p θ ( x | t ) = (cid:80) Ni =1 log p θ ( x i | x
3. MAXIMIZING MUTUAL INFORMATION (MMI) FORTACOTRON
Although the previous mentioned designs in Tacotron alleviate thelocal information preference, they weaken the autoregressive de-coder and decrease the model’s performance. A model using reduc-tion factor 2 generates better perceptual results than one using re-duction factor 5 [1]. This indicates that the more the autoregressivemodel is weakened, the more drop in performance is induced. Evenworse, Tacotron make mistakes, such as repeating words, omittingwords and incomplete sentences, which seldomly appear in HMM-based methods [25] or NN-based frame-to-frame regression methods[26, 27, 28]. The dependency between the predicted acoustic fea-tures and the text input in Tacotron is not sufficiently modeled. If thedependency is sufficiently modeled and the model is penalized heav-ily when it makes mistakes during training, the generated acousticfeatures should strictly follow the text. So we take the InfoGAN ap-proach that maximize the mutual information between the predictedacoustic features and the input text during training to strengthen thedependency between them.
The mutual information between the input text, t , and the predictedacoustic features, (cid:101) x , is I ( (cid:101) x ; t )= H ( t ) − H ( t | (cid:101) x )= E (cid:101) x ∼ p α ( x ) [ E t ∼ p α ( t | (cid:101) x ) [log p α ( t | (cid:101) x )]] + H ( t )= E (cid:101) x ∼ p α ( x ) [ D KL ( p α ( t | (cid:101) x ) || q β ( t | (cid:101) x )) + E t ∼ p α ( t | (cid:101) x ) [log q β ( t | (cid:101) x )]]+ H ( t ) (4) ≥ E (cid:101) x ∼ p α ( x ) [ E t ∼ p α ( t | (cid:101) x ) [log q β ( t | (cid:101) x )]] + H ( t )= E t ∼ p ( t ) , (cid:101) x ∼ p α ( x | t ) [log q β ( t | (cid:101) x )] + H ( t ) (5) α = { θ, φ } is the CAR model parameters. In Eq 4 we introduce anauxiliary distribution q β ( t | (cid:101) x ) to approximate the posterior p α ( t | (cid:101) x ) since it is intractable. The lower bound derivation uses the varia-tional information maximization technique [29, 16]. H ( t ) is a con-stant for our problem. From Eq 5, we can see that maximizing themutual information between the input text and the predicted acousticfeatures is equivalent to training an auxiliary recognizer which maxi-mizes the probability of recognizing the input text from the predictedacoustic features with respect to the CAR model parameters, α , andthe auxiliary recognizer parameters, β . This is intuitively sound.If the predicted acoustic features are consistently recognized as theinput text, of course the model gets the correct result. Adding themutual information term to the training objective in Eq 1 can pe-nalize the model if it ignores the dependency between the predictedacoustic features and the text. When this penalty is stronger than the KL -divergence term in Eq 3, the model learns meaningful time-aligned latent variables to exploit the text. To keep the end-to-end property, we use a simple CTC recognizeras the auxiliary recognizer. The CTC recognizer uses the sameconvolution stack + bidirectional LSTM [30] layer structure as theTacotron2’s text encoder for simplicity except that the former hasan extra CTC loss layer. Lack of a language model is usually con-sidered as a drawback of the CTC recognizer [31]. However, thisquite meets our demand, since we do not want a language modelto remedy the detected errors. Minimizing the CTC loss couldstrengthen the dependency between the predicted acoustic featuresand the input text during training.The final loss function is: L = | x mel − (cid:101) x mel | + | x linear − (cid:101) x linear | + CELoss ( x stop , (cid:101) x stop ) + λCT CLoss ( t , (cid:101) x mel ) (6) | · | is L1 norm. The first 2 RHS terms are the reconstruction lossesfor Mel spectrum and linear spectrum. Also the model minimizethe cross entropy loss for stop tokens and the CTC loss between thepredicted Mel spectrum and the text to be synthesized. λ controls therelative weight for the CTC loss. The linear loss is used, because weuse the Griffin-Lim algorithm to reconstruct waveforms to monitorthe training progress.
4. EXPERIMENTS
We show that maximizing the mutual information between the pre-dicted acoustics and the text to be synthesized can reduce the rate ofbad case.
We use LJSpeech for English and Databaker Chinese Standard Man-darin Speech Corpus (db-CSMSC) for Mandarin Chinese in ourexperiments. LJSpeech contains , audio clips of a single fe-male speaker. We process the transcriptions with Festival [32] toget the phoneme sequences. db-CSMSC contains , standardMandarin sentences recorded by a single female native speaker andrecorded in a professional recording studio. The dataset contains theChinese character and pinyin transcriptions and hand-crafted timeintervals. In our experiments, we only use the pinyin transcriptionand transfer the pinyin sequence to a pinyin scheme which con-tains initials and sub-finals. Our pinyin scheme contains much lessunits than the initial-final pinyin scheme. It can alleviate the out-of-vocabulary and data sparsity problems.All the waveforms are downsampled to 16k Hz in our experi-ments. We extract 2048-point STFT magnitudes with Hanning win-dow and wrap the features with Mel filter to 80-band Mel spectrum.We use 12.5ms/50ms window shift for our experiments. Then a log operation is applied to linear spectrum and Mel spectrum. We use re-peat padding for the training samples of different lengths in a batchsince zero padding would affect the batch normalization statistics.We use the Adam [33] optimizer with β = 0 . , β = 0 . and (cid:15) = 10 − . The initial learning rate is . and starts to decay by afactor of (cid:112) /step from step [34]. The gradient is clippedto maximum global norm of . [35]. We use Tacotron2 for our able 2 . Utterance error rate (UER) for different configurations. (RFis short for reduction factor and DFR is short for drop frame rate)RF 2 RF 5corpus DFR 0.0 DFR 0.2 DFR 0.0LJSpeech no MMI 16% 15% 10%MMI 10% 5% -db-CSMSC no MMI 17% 12% 7%MMI 5% 4% - Table 3 . Mean opinion score (MOS) with 95% confidence intervalsfor different configurations.DFR 0.0 DFR 0.2 MMI + DFR 0.0 MMI + DFR 0.2MOS 3.84 ± ± ± ± / of the dataset asthe validation set. For English, the test cases are randomly chosenfrom the 1132 CMU ARCTIC [36] sentences. For Mandarin Chi-nese, the test cases are chosen from text of different domains. Weuse a test set of 100 sentences for listening test. The average num-bers of words/characters and phonemes in one utterance is 8.8 and32.1 for English and 15.6 and 41.4 for Chinese. We use an open-sourced WaveRNN vocoder to reconstruct waveforms from Melspectrums. In Tacotron2, the attention context is concatenated to the LSTM out-put and projected by a linear transform to predict the Mel spectrum.This means the predicated Mel spectrum contains linear componentsof the text information. If we use this Mel spectrum as the input tothe CTC recognizer, the text information is too easily accessible forthe recognizer. This may cause the text information to be encodedin a pathological way in the Mel spectrum and lead to a strict di-agonal alignment map (one acoustic frame output for one phonemeinput) combined with location-sensitive attention. So before the lin-ear transform operation, we add an extra LSTM layer to mix the textinformation and acoustic information. λ is set to . in our experi-ments and the checkpoint for evaluation is selected at 200k trainingsteps.We use the default configuration with reduction factor (RF) 5 asthe baseline and we test the effectiveness of MMI and drop teacherforcing frame trick with RF 2. The results are recorded in Table 2.We can see that both MMI and drop frame rate (DFR) 0.2 can re-duce the utterance error rate (UER). We observe that the gap of thereconstruction error (the first 2 term in RHS of Eq. 6) between train-ing and validation sets begins to increase from 10k steps when MMIis not used. This does not happen when MMI is used as depictedin Figure 1, indicating the MMI training objective prevents the au-toregressive module from fitting the non-linguistic detail in acousticfeatures, while the original training objective is not a good indicatoras it does not take into consideration the dependency between theacoustic features and text.We conduct a mean opinion score (MOS) test to see whether theextra MMI objective would degenerate the synthesized waveformquality or not. Only correctly synthesized waveforms are selected https://github.com/fatchord/WaveRNN for this test. From Table 3, we can see that Tacotron2 with DFR 0.2achieves the best perceptual result and the model with MMI achievessimilar perceptual performance. r e c o n s t r u c t i o n e rr o r trainvalid (a) Loss curve without MMI r e c o n s t r u c t i o n e rr o r trainvalid (b) Loss curve with MMI Fig. 1 . Loss curves for training and validation set. The x- and y-axisare training step and reconstruction error. Training curves are orangesolid lines and validation curves are blue dashed lines. (a) is the plotfor a model trained without MMI and (b) is for one with MMI.
Many previous works focus on improving Tacotron’s reliability. In[23], professor forcing is adopted to mitigate the exposure bias in-duced by training with teacher forcing. The authors use diagonalattention penalty to enforce that the alignment between the acous-tic features and the text is approximately diagonal in [37]. In [38],the authors propose to use the alignment information form hand-crafted labels or from an HHM-based system to guide the atten-tion for Tacotorn. Since a large body of legacy corpus and HMM-based systems exist, this is an efficient way to improve Tacotron.However, it is not trained in an end-to-end way. The implicit dura-tion model of Tacotron uses alignment information that is not self-contained. Transformer-TTS adopts self-attention structure to im-prove the training and inference efficiency and to shorten the longrange dependency path between any two inputs at different timesteps [5].In the speech-to-speech translation task [24], experiment resultsdemonstrated that the multi-task recognition loss worked, but with-out proper explanation. It can be explained by Eq. 5, where min-imizing the multi-task recognition loss can be interpreted as max-imizing the mutual information between the learned hidden repre-sentation and the corresponding text in that task. When training withthe multi-task recognition loss, the learned hidden representation en-codes more linguistic information rather than acoustic informationonly, results in a better fit for the speech translation task.
5. CONCLUSION
In this paper we analyze why Tacotron is prone to synthesis errors.In short, modeling the correlation between the text and the acousticfeatures sufficiently is important to avoid the bad cases. To gain thisobjective, we propose to maximize the mutual information betweenthe text and the predicted acoustic features with an auxiliary CTCrecognizer. Experiment results show that our method can reduce therate of bad cases. Besides our method can be trained in an end-to-endmanner. It keeps the short pipeline of the original method. . REFERENCES [1] Y. Wang et al, “Tacotron: A fully end-to-end text-to-speechsynthesis model,”
CoRR , vol. abs/1703.10135, 2017.[2] J. Shen et al, “Natural TTS synthesis by conditioning waveneton MEL spectrogram predictions,” in
ICASSP , 2018, pp. 4779–4783.[3] R. J. Williams and D. Zipser, “A learning algorithm for contin-ually running fully recurrent neural networks,”
Neural Com-putation , vol. 1, no. 2, pp. 270–280, 1989.[4] D. Bahdanau, K. Cho and Y. Bengio, “Neural machine transla-tion by jointly learning to align and translate,” in
ICLR , 2015.[5] N. Li et al, “Close to human quality TTS with transformer,”
CoRR , vol. abs/1809.08895, 2018.[6] A. Vaswani et al, “Attention is all you need,” in
NIPS , 2017,pp. 6000–6010.[7] R. J. Skerry-Ryan et al, “Towards end-to-end prosody transferfor expressive speech synthesis with tacotron,” in
ICML , 2018,pp. 4700–4709.[8] Y. Wang et al, “Style tokens: Unsupervised style modeling,control and transfer in end-to-end speech synthesis,” in
ICML ,2018, pp. 5167–5176.[9] A. Gibiansky et al, “Deep voice 2: Multi-speaker neural text-to-speech,” in
NIPS , 2017, pp. 2966–2974.[10] Y. Jia et al, “Transfer learning from speaker verification tomultispeaker text-to-speech synthesis,” in
NeurIPS , 2018, pp.4485–4495.[11] N. Kalchbrenner et al, “Efficient neural audio synthesis,” in
ICML , 2018, pp. 2415–2424.[12] A. van den Oord et al, “Wavenet: A generative model for rawaudio,” in
ISCA , 2016, p. 125.[13] A. van den Oord et al, “Parallel wavenet: Fast high-fidelityspeech synthesis,” in
ICML , 2018, pp. 3915–3923.[14] M. Ranzato et al, “Sequence level training with recurrent neu-ral networks,” in
ICLR , 2016.[15] X. Chen et al, “Variational lossy autoencoder,” in
ICLR , 2017.[16] X. Chen et al, “Infogan: Interpretable representation learn-ing by information maximizing generative adversarial nets,” in
NIPS , 2016, pp. 2172–2180.[17] I. J. Goodfellow et al, “Generative adversarial networks,”
CoRR , vol. abs/1406.2661, 2014.[18] C. Zhou and G. Neubig, “Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction,”in
ACL , 2017, pp. 310–320.[19] S. Shankar and S. Sarawagi, “Posterior attention models forsequence to sequence learning,” in
International Conferenceon Learning Representations , 2019.[20] X. Wu et al, “Attention-based recurrent generator with gaus-sian tolerance for statistical parametric speech synthesis,” in
ASMMC , 2017.[21] S. R. Bowman et al, “Generating sentences from a continuousspace,” in
SIGNLL , 2016, pp. 10–21.[22] K. Ito, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/ , 2017. [23] H. Guo et al, “A new gan-based end-to-end tts training algo-rithm,”
CoRR , vol. abs/1904.04775, 2019.[24] Y. Jia et al, “Direct speech-to-speech translation with asequence-to-sequence model,”
CoRR , vol. abs/1904.06037,2019.[25] M. Coto-Jim´enez and J. G. Close, “Speech synthesis basedon hidden markov models and deep learning,”
Research inComputing Science , vol. 112, pp. 19–28, 2016.[26] Y. Fan et al, “TTS synthesis with bidirectional LSTM based re-current neural networks,” in
INTERSPEECH , 2014, pp. 1964–1968.[27] S. Kang and H. M. Meng, “Statistical parametric speech syn-thesis using weighted multi-distribution deep belief network,”in
INTERSPEECH , 2014, pp. 1959–1963.[28] H. Zen, A. W. Senior and M. Schuster, “Statistical paramet-ric speech synthesis using deep neural networks,” in
ICASSP ,2013, pp. 7962–7966.[29] D. Barber and F. V. Agakov, “The IM algorithm: A variationalapproach to information maximization,” in
NIPS , 2003, pp.201–208.[30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation , vol. 9, no. 8, pp. 1735–1780, 1997.[31] A. Graves, “Sequence transduction with recurrent neural net-works,”
CoRR , vol. abs/1211.3711, 2012.[32] P. Taylor, A. W. Black and R. Caley, “The architecture of thefestival speech synthesis system,” in
ESCA/COCOSDA , 1998,pp. 147–152.[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” in
ICLR , 2015.[34] A. Vaswani et al, “Tensor2tensor for neural machine transla-tion,” in
AMTA , 2018, pp. 193–199.[35] R. Pascanu, T. Mikolov and Y. Bengio, “On the difficulty oftraining recurrent neural networks,” in
ICML , 2013, pp. 1310–1318.[36] J. Kominek and A. W. Black, “The CMU arctic speechdatabases,” in
ISCA ITRW on Speech Synthesis , 2004, pp. 223–224.[37] H. Tachibana, K. Uenoyama and S. Aihara, “Efficiently train-able text-to-speech system based on deep convolutional net-works with guided attention,” in
ICASSP , 2018, pp. 4784–4788.[38] X. Zhu et al, “Pre-alignment guided attention for improvingtraining efficiency and model stability in end-to-end speechsynthesis,”