[PDF] Maximizing Mutual Information for Tacotron

Abstract

End-to-end speech synthesis methods already achieve close-to-human quality performance. However compared to HMM-based and NN-based frame-to-frame regression methods, they are prone to some synthesis errors, such as missing or repeating words and incomplete synthesis. We attribute the comparatively high utterance error rate to the local information preference of conditional autoregressive models, and the ill-posed training objective of the model, which describes mostly the training status of the autoregressive module, but rarely that of the condition module. Inspired by InfoGAN, we propose to maximize the mutual information between the text condition and the predicted acoustic features to strengthen the dependency between them for CAR speech synthesis model, which would alleviate the local information preference issue and reduce the utterance error rate. The training objective of maximizing mutual information can be considered as a metric of the dependency between the autoregressive module and the condition module. Experiment results show that our method can reduce the utterance error rate.

Full PDF

MMAXIMIZING MUTUAL INFORMATION FOR TACOTRON

Peng Liu † , Xixin Wu ‡ , Shiyin Kang † , Guangzhi Li † , Dan Su † , Dong Yu †† Tencent AI Lab ‡ Department of Systems Engineering and Engineering Management,The Chinese University of Hong Kong, China { feanorliu, shiyinkang, guangzhilei, dansu, dyu } @[email protected] ABSTRACT

End-to-end speech synthesis methods already achieve close-to-human quality performance. However compared to HMM-basedand NN-based frame-to-frame regression methods, they are proneto some synthesis errors, such as missing or repeating words andincomplete synthesis. We attribute the comparatively high utteranceerror rate to the local information preference of conditional autore-gressive models, and the ill-posed training objective of the model,which describes mostly the training status of the autoregressivemodule, but rarely that of the condition module. Inspired by Info-GAN, we propose to maximize the mutual information between thetext condition and the predicted acoustic features to strengthen thedependency between them for CAR speech synthesis model, whichwould alleviate the local information preference issue and reduce theutterance error rate. The training objective of maximizing mutual in-formation can be considered as a metric of the dependency betweenthe autoregressive module and the condition module. Experimentresults show that our method can reduce the utterance error rate . Index Terms — speech synthesis, end-to-end, mutual informa-tion, Tacotron, conditional autoregressive model

1. INTRODUCTION

Tacotron [1] and Tacotron2 [2] are conditional autoregressive (CAR)models trained with teacher forcing [3]. The condition is summa-rized from the input text with attention mechanism [4]. Transformer-TTS [5] can be considered as another instance of CAR model, witheffective utilization of self-attention mechanism [6]. Such architec-ture can be trained in an end-to-end way, so it has a much shorterpipeline and needs less expert knowledge and human labor. It isﬂexible enough to adapt for speaking style [7, 8] and multi-speaker[9, 10]. In addition, it is easy to be combined with neural vocoder[11, 12, 13] to enhance the synthesized waveform quality.Training with teacher forcing induces a mismatch between thetraining period and the inference period, usually known as exposurebias [14]. Even worse, it strengthens the local information prefer-ence [15] for the CAR model. We explain the local informationpreference intuitively ﬁrst. At each time step during training, theCAR model receives a teacher forcing input and a conditional input.The teacher forcing input is one previous time step from the tar-get. The conditional input is the text to be synthesized. If the CARmodel learns to copy the teacher forcing input, or to predict the targettotally depending on teacher forcing input without using the condi-tional information, it still gets small training root mean square error Code is available at https://github.com/bfs18/tacotron2 (RMSE). Finally the model, which achieves small RMSE, may notlearn to depend on the condition at all. So at the inference period,the CAR model generates results that have nothing to do with thecondition. Note that local information preference still exists even ifteacher forcing is not used. When a random variable x admits au-toregressive dependency over a conditional random variable z , i.e. p ( x | z ) = (cid:81) i ( x i | x

2. WHY CAR TTS MODEL TENDS TO IGNORE THETEXT CONDITION

In this section we ﬁrst explain local information preference for CARmodel formally. Then we explain why Tacotron still works though ittends to ignore the text condition.

Usually we perform maximum likelihood estimation (MLE) totrain a CAR speech synthesis model. And the model communi-cates information form the text to the acoustic features through thetime-aligned latent variables. Such latent variables exist in vari-ous speech recognition and synthesis systems, such as the hiddenstates in the HMM-based speech synthesis system, the forward-backward search matrix in the CTC recognizer, and the attentionvariables in Tacotron. We can formalize the CAR speech synthe-sis model as a variational encoder-decoder (VED) [18]. We use t and x to represent a text and its corresponding acoustic featuresin the training set. Since it is a CAR model, the conditional like-lihood can be written as log p θ ( x | t ) = (cid:80) Ni =1 log p θ ( x i | x
3. MAXIMIZING MUTUAL INFORMATION (MMI) FORTACOTRON

Although the previous mentioned designs in Tacotron alleviate thelocal information preference, they weaken the autoregressive de-coder and decrease the model’s performance. A model using reduc-tion factor 2 generates better perceptual results than one using re-duction factor 5 [1]. This indicates that the more the autoregressivemodel is weakened, the more drop in performance is induced. Evenworse, Tacotron make mistakes, such as repeating words, omittingwords and incomplete sentences, which seldomly appear in HMM-based methods [25] or NN-based frame-to-frame regression methods[26, 27, 28]. The dependency between the predicted acoustic fea-tures and the text input in Tacotron is not sufﬁciently modeled. If thedependency is sufﬁciently modeled and the model is penalized heav-ily when it makes mistakes during training, the generated acousticfeatures should strictly follow the text. So we take the InfoGAN ap-proach that maximize the mutual information between the predictedacoustic features and the input text during training to strengthen thedependency between them.

The mutual information between the input text, t , and the predictedacoustic features, (cid:101) x , is I ( (cid:101) x ; t )= H ( t ) − H ( t | (cid:101) x )= E (cid:101) x ∼ p α ( x ) [ E t ∼ p α ( t | (cid:101) x ) [log p α ( t | (cid:101) x )]] + H ( t )= E (cid:101) x ∼ p α ( x ) [ D KL ( p α ( t | (cid:101) x ) || q β ( t | (cid:101) x )) + E t ∼ p α ( t | (cid:101) x ) [log q β ( t | (cid:101) x )]]+ H ( t ) (4) ≥ E (cid:101) x ∼ p α ( x ) [ E t ∼ p α ( t | (cid:101) x ) [log q β ( t | (cid:101) x )]] + H ( t )= E t ∼ p ( t ) , (cid:101) x ∼ p α ( x | t ) [log q β ( t | (cid:101) x )] + H ( t ) (5) α = { θ, φ } is the CAR model parameters. In Eq 4 we introduce anauxiliary distribution q β ( t | (cid:101) x ) to approximate the posterior p α ( t | (cid:101) x ) since it is intractable. The lower bound derivation uses the varia-tional information maximization technique [29, 16]. H ( t ) is a con-stant for our problem. From Eq 5, we can see that maximizing themutual information between the input text and the predicted acousticfeatures is equivalent to training an auxiliary recognizer which maxi-mizes the probability of recognizing the input text from the predictedacoustic features with respect to the CAR model parameters, α , andthe auxiliary recognizer parameters, β . This is intuitively sound.If the predicted acoustic features are consistently recognized as theinput text, of course the model gets the correct result. Adding themutual information term to the training objective in Eq 1 can pe-nalize the model if it ignores the dependency between the predictedacoustic features and the text. When this penalty is stronger than the KL -divergence term in Eq 3, the model learns meaningful time-aligned latent variables to exploit the text. To keep the end-to-end property, we use a simple CTC recognizeras the auxiliary recognizer. The CTC recognizer uses the sameconvolution stack + bidirectional LSTM [30] layer structure as theTacotron2’s text encoder for simplicity except that the former hasan extra CTC loss layer. Lack of a language model is usually con-sidered as a drawback of the CTC recognizer [31]. However, thisquite meets our demand, since we do not want a language modelto remedy the detected errors. Minimizing the CTC loss couldstrengthen the dependency between the predicted acoustic featuresand the input text during training.The ﬁnal loss function is: L = | x mel − (cid:101) x mel | + | x linear − (cid:101) x linear | + CELoss ( x stop , (cid:101) x stop ) + λCT CLoss ( t , (cid:101) x mel ) (6) | · | is L1 norm. The ﬁrst 2 RHS terms are the reconstruction lossesfor Mel spectrum and linear spectrum. Also the model minimizethe cross entropy loss for stop tokens and the CTC loss between thepredicted Mel spectrum and the text to be synthesized. λ controls therelative weight for the CTC loss. The linear loss is used, because weuse the Grifﬁn-Lim algorithm to reconstruct waveforms to monitorthe training progress.

4. EXPERIMENTS

We show that maximizing the mutual information between the pre-dicted acoustics and the text to be synthesized can reduce the rate ofbad case.

We use LJSpeech for English and Databaker Chinese Standard Man-darin Speech Corpus (db-CSMSC) for Mandarin Chinese in ourexperiments. LJSpeech contains , audio clips of a single fe-male speaker. We process the transcriptions with Festival [32] toget the phoneme sequences. db-CSMSC contains , standardMandarin sentences recorded by a single female native speaker andrecorded in a professional recording studio. The dataset contains theChinese character and pinyin transcriptions and hand-crafted timeintervals. In our experiments, we only use the pinyin transcriptionand transfer the pinyin sequence to a pinyin scheme which con-tains initials and sub-ﬁnals. Our pinyin scheme contains much lessunits than the initial-ﬁnal pinyin scheme. It can alleviate the out-of-vocabulary and data sparsity problems.All the waveforms are downsampled to 16k Hz in our experi-ments. We extract 2048-point STFT magnitudes with Hanning win-dow and wrap the features with Mel ﬁlter to 80-band Mel spectrum.We use 12.5ms/50ms window shift for our experiments. Then a log operation is applied to linear spectrum and Mel spectrum. We use re-peat padding for the training samples of different lengths in a batchsince zero padding would affect the batch normalization statistics.We use the Adam [33] optimizer with β = 0 . , β = 0 . and (cid:15) = 10 − . The initial learning rate is . and starts to decay by afactor of (cid:112) /step from step [34]. The gradient is clippedto maximum global norm of . [35]. We use Tacotron2 for our able 2 . Utterance error rate (UER) for different conﬁgurations. (RFis short for reduction factor and DFR is short for drop frame rate)RF 2 RF 5corpus DFR 0.0 DFR 0.2 DFR 0.0LJSpeech no MMI 16% 15% 10%MMI 10% 5% -db-CSMSC no MMI 17% 12% 7%MMI 5% 4% - Table 3 . Mean opinion score (MOS) with 95% conﬁdence intervalsfor different conﬁgurations.DFR 0.0 DFR 0.2 MMI + DFR 0.0 MMI + DFR 0.2MOS 3.84 ± ± ± ± / of the dataset asthe validation set. For English, the test cases are randomly chosenfrom the 1132 CMU ARCTIC [36] sentences. For Mandarin Chi-nese, the test cases are chosen from text of different domains. Weuse a test set of 100 sentences for listening test. The average num-bers of words/characters and phonemes in one utterance is 8.8 and32.1 for English and 15.6 and 41.4 for Chinese. We use an open-sourced WaveRNN vocoder to reconstruct waveforms from Melspectrums. In Tacotron2, the attention context is concatenated to the LSTM out-put and projected by a linear transform to predict the Mel spectrum.This means the predicated Mel spectrum contains linear componentsof the text information. If we use this Mel spectrum as the input tothe CTC recognizer, the text information is too easily accessible forthe recognizer. This may cause the text information to be encodedin a pathological way in the Mel spectrum and lead to a strict di-agonal alignment map (one acoustic frame output for one phonemeinput) combined with location-sensitive attention. So before the lin-ear transform operation, we add an extra LSTM layer to mix the textinformation and acoustic information. λ is set to . in our experi-ments and the checkpoint for evaluation is selected at 200k trainingsteps.We use the default conﬁguration with reduction factor (RF) 5 asthe baseline and we test the effectiveness of MMI and drop teacherforcing frame trick with RF 2. The results are recorded in Table 2.We can see that both MMI and drop frame rate (DFR) 0.2 can re-duce the utterance error rate (UER). We observe that the gap of thereconstruction error (the ﬁrst 2 term in RHS of Eq. 6) between train-ing and validation sets begins to increase from 10k steps when MMIis not used. This does not happen when MMI is used as depictedin Figure 1, indicating the MMI training objective prevents the au-toregressive module from ﬁtting the non-linguistic detail in acousticfeatures, while the original training objective is not a good indicatoras it does not take into consideration the dependency between theacoustic features and text.We conduct a mean opinion score (MOS) test to see whether theextra MMI objective would degenerate the synthesized waveformquality or not. Only correctly synthesized waveforms are selected https://github.com/fatchord/WaveRNN for this test. From Table 3, we can see that Tacotron2 with DFR 0.2achieves the best perceptual result and the model with MMI achievessimilar perceptual performance. r e c o n s t r u c t i o n e rr o r trainvalid (a) Loss curve without MMI r e c o n s t r u c t i o n e rr o r trainvalid (b) Loss curve with MMI Fig. 1 . Loss curves for training and validation set. The x- and y-axisare training step and reconstruction error. Training curves are orangesolid lines and validation curves are blue dashed lines. (a) is the plotfor a model trained without MMI and (b) is for one with MMI.

Many previous works focus on improving Tacotron’s reliability. In[23], professor forcing is adopted to mitigate the exposure bias in-duced by training with teacher forcing. The authors use diagonalattention penalty to enforce that the alignment between the acous-tic features and the text is approximately diagonal in [37]. In [38],the authors propose to use the alignment information form hand-crafted labels or from an HHM-based system to guide the atten-tion for Tacotorn. Since a large body of legacy corpus and HMM-based systems exist, this is an efﬁcient way to improve Tacotron.However, it is not trained in an end-to-end way. The implicit dura-tion model of Tacotron uses alignment information that is not self-contained. Transformer-TTS adopts self-attention structure to im-prove the training and inference efﬁciency and to shorten the longrange dependency path between any two inputs at different timesteps [5].In the speech-to-speech translation task [24], experiment resultsdemonstrated that the multi-task recognition loss worked, but with-out proper explanation. It can be explained by Eq. 5, where min-imizing the multi-task recognition loss can be interpreted as max-imizing the mutual information between the learned hidden repre-sentation and the corresponding text in that task. When training withthe multi-task recognition loss, the learned hidden representation en-codes more linguistic information rather than acoustic informationonly, results in a better ﬁt for the speech translation task.

5. CONCLUSION

In this paper we analyze why Tacotron is prone to synthesis errors.In short, modeling the correlation between the text and the acousticfeatures sufﬁciently is important to avoid the bad cases. To gain thisobjective, we propose to maximize the mutual information betweenthe text and the predicted acoustic features with an auxiliary CTCrecognizer. Experiment results show that our method can reduce therate of bad cases. Besides our method can be trained in an end-to-endmanner. It keeps the short pipeline of the original method. . REFERENCES [1] Y. Wang et al, “Tacotron: A fully end-to-end text-to-speechsynthesis model,”

CoRR , vol. abs/1703.10135, 2017.[2] J. Shen et al, “Natural TTS synthesis by conditioning waveneton MEL spectrogram predictions,” in

ICASSP , 2018, pp. 4779–4783.[3] R. J. Williams and D. Zipser, “A learning algorithm for contin-ually running fully recurrent neural networks,”

Neural Com-putation , vol. 1, no. 2, pp. 270–280, 1989.[4] D. Bahdanau, K. Cho and Y. Bengio, “Neural machine transla-tion by jointly learning to align and translate,” in

ICLR , 2015.[5] N. Li et al, “Close to human quality TTS with transformer,”

CoRR , vol. abs/1809.08895, 2018.[6] A. Vaswani et al, “Attention is all you need,” in

NIPS , 2017,pp. 6000–6010.[7] R. J. Skerry-Ryan et al, “Towards end-to-end prosody transferfor expressive speech synthesis with tacotron,” in

ICML , 2018,pp. 4700–4709.[8] Y. Wang et al, “Style tokens: Unsupervised style modeling,control and transfer in end-to-end speech synthesis,” in

ICML ,2018, pp. 5167–5176.[9] A. Gibiansky et al, “Deep voice 2: Multi-speaker neural text-to-speech,” in

NIPS , 2017, pp. 2966–2974.[10] Y. Jia et al, “Transfer learning from speaker veriﬁcation tomultispeaker text-to-speech synthesis,” in

NeurIPS , 2018, pp.4485–4495.[11] N. Kalchbrenner et al, “Efﬁcient neural audio synthesis,” in

ICML , 2018, pp. 2415–2424.[12] A. van den Oord et al, “Wavenet: A generative model for rawaudio,” in

ISCA , 2016, p. 125.[13] A. van den Oord et al, “Parallel wavenet: Fast high-ﬁdelityspeech synthesis,” in

ICML , 2018, pp. 3915–3923.[14] M. Ranzato et al, “Sequence level training with recurrent neu-ral networks,” in

ICLR , 2016.[15] X. Chen et al, “Variational lossy autoencoder,” in

ICLR , 2017.[16] X. Chen et al, “Infogan: Interpretable representation learn-ing by information maximizing generative adversarial nets,” in

NIPS , 2016, pp. 2172–2180.[17] I. J. Goodfellow et al, “Generative adversarial networks,”

CoRR , vol. abs/1406.2661, 2014.[18] C. Zhou and G. Neubig, “Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction,”in

ACL , 2017, pp. 310–320.[19] S. Shankar and S. Sarawagi, “Posterior attention models forsequence to sequence learning,” in

International Conferenceon Learning Representations , 2019.[20] X. Wu et al, “Attention-based recurrent generator with gaus-sian tolerance for statistical parametric speech synthesis,” in

ASMMC , 2017.[21] S. R. Bowman et al, “Generating sentences from a continuousspace,” in

SIGNLL , 2016, pp. 10–21.[22] K. Ito, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/ , 2017. [23] H. Guo et al, “A new gan-based end-to-end tts training algo-rithm,”

CoRR , vol. abs/1904.04775, 2019.[24] Y. Jia et al, “Direct speech-to-speech translation with asequence-to-sequence model,”

CoRR , vol. abs/1904.06037,2019.[25] M. Coto-Jim´enez and J. G. Close, “Speech synthesis basedon hidden markov models and deep learning,”

Research inComputing Science , vol. 112, pp. 19–28, 2016.[26] Y. Fan et al, “TTS synthesis with bidirectional LSTM based re-current neural networks,” in

INTERSPEECH , 2014, pp. 1964–1968.[27] S. Kang and H. M. Meng, “Statistical parametric speech syn-thesis using weighted multi-distribution deep belief network,”in

INTERSPEECH , 2014, pp. 1959–1963.[28] H. Zen, A. W. Senior and M. Schuster, “Statistical paramet-ric speech synthesis using deep neural networks,” in

ICASSP ,2013, pp. 7962–7966.[29] D. Barber and F. V. Agakov, “The IM algorithm: A variationalapproach to information maximization,” in

NIPS , 2003, pp.201–208.[30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural Computation , vol. 9, no. 8, pp. 1735–1780, 1997.[31] A. Graves, “Sequence transduction with recurrent neural net-works,”

CoRR , vol. abs/1211.3711, 2012.[32] P. Taylor, A. W. Black and R. Caley, “The architecture of thefestival speech synthesis system,” in

ESCA/COCOSDA , 1998,pp. 147–152.[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” in

ICLR , 2015.[34] A. Vaswani et al, “Tensor2tensor for neural machine transla-tion,” in

AMTA , 2018, pp. 193–199.[35] R. Pascanu, T. Mikolov and Y. Bengio, “On the difﬁculty oftraining recurrent neural networks,” in

ICML , 2013, pp. 1310–1318.[36] J. Kominek and A. W. Black, “The CMU arctic speechdatabases,” in

ISCA ITRW on Speech Synthesis , 2004, pp. 223–224.[37] H. Tachibana, K. Uenoyama and S. Aihara, “Efﬁciently train-able text-to-speech system based on deep convolutional net-works with guided attention,” in

ICASSP , 2018, pp. 4784–4788.[38] X. Zhu et al, “Pre-alignment guided attention for improvingtraining efﬁciency and model stability in end-to-end speechsynthesis,”

Related Researches

CDPAM: Contrastive learning for perceptual audio similarity

by Pranay Manocha

End-to-End Multi-Channel Transformer for Speech Recognition

by Feng-Ju Chang

Non-linear frequency warping using constant-Q transformation for speech emotion recognition

by Premjeet Singh

Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement

by Mostafa Sadeghi

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

by Jisi Zhang

EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

by Yu-Wen Chen

The DKU-Duke-Lenovo System Description for the Third DIHARD Speech Diarization Challenge

by Weiqing Wang

Speaker attribution with voice profiles by graph-based semi-supervised learning

by Jixuan Wang

Sound Event Detection in Urban Audio With Single and Multi-Rate PCEN

by Christopher Ick

Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output

by Hangting Chen

Intermediate Loss Regularization for CTC-based Speech Recognition

by Jaesong Lee

Estimation of Microphone Clusters in Acoustic Sensor Networks using Unsupervised Federated Learning

by Alexandru Nelus

VSEGAN: Visual Speech Enhancement Generative Adversarial Network

by Xinmeng Xu

A Global-local Attention Framework for Weakly Labelled Audio Tagging

by Helin Wang

Underwater Acoustic Communication Receiver Using Deep Belief Network

by Abigail Lee-Leon

Integration of deep learning with expectation maximization for spatial cue based speech separation in reverberant conditions

by Sania Gul

Meta-Learning for improving rare word recognition in end-to-end ASR

by Florian Lux

Automatic Classification of OSA related Snoring Signals from Nocturnal Audio Recordings

by Arun Sebastian

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

by Ju Lin

SEP-28k: A Dataset for Stuttering Event Detection From Podcasts With People Who Stutter

by Colin Lea

The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates

by Björn W. Schuller

Thoughts on the potential to compensate a hearing loss in noise

by Marc René Schädler

Unidirectional Memory-Self-Attention Transducer for Online Speech Recognition

by Jian Luo

Handling Background Noise in Neural Speech Generation

by Tom Denton

Dual-Path Modeling for Long Recording Speech Separation in Meetings

by Chenda Li

«

1

2

3

4

»

Submitted on 30 Aug 2019 (v1), last revised 18 Nov 2019 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar