[PDF] Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language

Abstract

End-to-end speech synthesis is a promising approach that directly converts raw text to speech. Although it was shown that Tacotron2 outperforms classical pipeline systems with regards to naturalness in English, its applicability to other languages is still unknown. Japanese could be one of the most difficult languages for which to achieve end-to-end speech synthesis, largely due to its character diversity and pitch accents. Therefore, state-of-the-art systems are still based on a traditional pipeline framework that requires a separate text analyzer and duration model. Towards end-to-end Japanese speech synthesis, we extend Tacotron to systems with self-attention to capture long-term dependencies related to pitch accents and compare their audio quality with classical pipeline systems under various conditions to show their pros and cons. In a large-scale listening test, we investigated the impacts of the presence of accentual-type labels, the use of force or predicted alignments, and acoustic features used as local condition parameters of the Wavenet vocoder. Our results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, we show important stepping stones towards end-to-end Japanese speech synthesis.

Full PDF

IINVESTIGATION OF ENHANCED TACOTRON TEXT-TO-SPEECH SYNTHESIS SYSTEMSWITH SELF-ATTENTION FOR PITCH ACCENT LANGUAGE

Yusuke Yasuda , , Xin Wang , Shinji Takaki , Junichi Yamagishi , ∗ National Institute of Informatics, Japan The University of Edinburgh, Edinburgh, UK SOKENDAI (The Graduate University for Advanced Studies), Japan [email protected], [email protected], [email protected], [email protected]

ABSTRACT

End-to-end speech synthesis is a promising approach thatdirectly converts raw text to speech. Although it was shown thatTacotron2 outperforms classical pipeline systems with regards tonaturalness in English, its applicability to other languages is stillunknown. Japanese could be one of the most difﬁcult languagesfor which to achieve end-to-end speech synthesis, largely due toits character diversity and pitch accents. Therefore, state-of-the-art systems are still based on a traditional pipeline framework thatrequires a separate text analyzer and duration model. Towards end-to-end Japanese speech synthesis, we extend Tacotron to systemswith self-attention to capture long-term dependencies related topitch accents and compare their audio quality with classical pipelinesystems under various conditions to show their pros and cons. In alarge-scale listening test, we investigated the impacts of the presenceof accentual-type labels, the use of force or predicted alignments,and acoustic features used as local condition parameters of theWavenet vocoder. Our results reveal that although the proposedsystems still do not match the quality of a top-line pipeline systemfor Japanese, we show important stepping stones towards end-to-endJapanese speech synthesis.

Index Terms — speech synthesis, deep learning, Tacotron

1. INTRODUCTION

Tacotron [1] opened a novel path to end-to-end speech synthesis. Itenables us to directly convert input text to audio. Unlike traditionalpipeline methods that typically consist of separate text analyzer,acoustic, and duration models, Tacotron handles everything as asingle model, which reduces laborious feature engineering and errorpropagation across cascaded models. Indeed, Tacotron2, which is acombination of the Tacotron system and WaveNet [2], successfullygenerated audio signals that resulted in very high MOS scorescomparable to human speech [3].The above achievements of Tacotron and Tacotron2 and similarresults reported for Clarinet [4], and Transformer based TTS [5]are conﬁrmed only for English, and there have been only a fewinvestigations into such architectures with other languages to thebest of our knowledge. This is partially or mainly becauseadditional challenges must be overcome for other languages. Thisstudy focuses on the Japanese language, which is among the mostchallenging languages. ∗ This work was partially supported by JST CREST Grant NumberJPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302,17H04687, 18H04120, 18H04112, 18KT0051), Japan.

Japanese writing has three types of orthographical characters:Hiragana, Katakana, and Kanji (Chinese). The diversity ofcharacters in Japanese causes a critical problem related to rarecharacters. Moreover, Japanese is a pitch-accented language,and accentual-types (accent nucleus positions) may change themeanings of words. However, accentual-types are not explicitlyshown in Japanese characters. Moreover, due to the accent sandhiphenomena, accent nucleus positions are context dependent, sothey change positions depending on adjacent words. Because ofthese problems, state-of-the-art systems for Japanese are dominantlypipeline systems that still rely on an external text analyzer includinghand-written dictionaries and rules of pitch accent types for eachword or word-to-accentual-type predictors trained on such externalresources [6]. An end-to-end approach may potentially simplifythese process in data driven way.Towards the development of end-to-end Japanese TTS systems,we apply the Tacotron system to the Japanese language. We ﬁrstpropose enhanced systems with self-attention to capture long-termdependency better. We then compare their audio quality with thatof classical pipeline systems under various conditions. Finally, weconduct a large-scale listening test to investigate the impacts of thepresence of accentual-type labels, the use of force- or predictedalignments, and acoustic features used as local condition parametersof the Wavenet vocoder.The remaining part of this paper is structured as follows. InSection 2, we describe our Japanese Tacotron systems enhancedwith self-attention. Section 3 shows experimental conditions andthe results of a large-scale listening test. Section 4 concludes withour ﬁndings and our future work.

2. PROPOSED ARCHITECTURES FOR JAPANESE TTS2.1. Tacotron using phoneme and accentual type

In this section, we describe our slightly modiﬁed baseline Tacotron[1] that can handle Japanese accentual-type labels. We refer tothis system as

JA-Tacotron . Figure 1-A shows its architecture.Tacotron is a sequence-to-sequence architecture [7] that consistsof encoder and decoder networks. Unlike classical pipelinesystems with explicit duration models, Tacotron uses an attentionmechanism [8] that implicitly learns alignments between sourceand target sequences. In this paper, we use phoneme andaccentual-type sequences as a source and mel-spectrogram as atarget as our ﬁrst investigation towards end-to-end Japanese speechsynthesis. This baseline architecture is inspired from [9], whichapplied Tacotron to the Chinese language. On the encoder side,phoneme and accentual-type sequences are embedded to separateembedding tables with different dimensions, and the embedding a r X i v : . [ ee ss . A S ] F e b honemeembedding accentual-typeembeddingforward attention stop flag melspectrogram sigmoid linearsigmoid linear concat CBH-LSTM concat

CBH-LSTM self-attentionLSTMLSTM additiveattention forwardattentionself-attentionphonemeembedding accentual-typeembeddingpre-netconcat stop flag melspectrogram

A B C phonemeembedding accentual-typeembedding stop flag MGC log-F concat CBH-LSTM self-attentionpre-netpre-net pre-net pre-netpre-netpre-netpre-netpre-net pre-netLSTMself-attentionconcatsigmoid tanh softmaxlinearadditiveattention forwardattention

Fig. 1 : Architectures of proposed systems with accentual-typeembedding. A:

JA-Tacotron . B:

SA-Tacotron , C:

SA-Tacotron usingvocoder parameters.vectors are bottle-necked by their corresponding pre-nets [1]. Thetwo inputs are then concatenated and encoded by ConvolutionBanks, Highway networks, bidirectional-LSTM (CBH-LSTM) withzoneout regularization [10].At the decoder, encoded values are decoded with attentionbased LSTM decoder. We use forward attention [9] instead ofadditive attention [8] as an attention mechanism. As suggested in[9], the forward attention accelerates the alignment learning speedand provides distinct and robust alignment with less training timethan the original Tacotron. The decoder LSTM is regularized withzoneout as well as the encoder since it is expected that the zoneoutregularization will reduce alignment errors. We set the reductionfactor to be two so that the decoder outputs two frames at eachtime step. A predicted mel-spectrogram is converted to an audiowaveform with WaveNet [2]. We use a frame shift of 12.5 ms for themel-spectrogram to train the

JA-Tacotron model as in [3] A pitch-accent language like Japanese uses lexical pitch accentsthat involve F changes. Japanese is a ”mora-timed” pitch-accentlanguage: that means there is an accent nucleus position countedin mora units within an accentual phrase. Pitch accents havea large impact on the perceptual naturalness of speech becauseincorrect pitch accents may be judged as incorrect “pronunciations“by listeners even if they have correct phone realization. Moreover,accentual phrases in Japanese normally have mora of varyinglengths. Since the length of an accentual phrase could be verylong, we hypothesize that long-term information plays a signiﬁcantlyimportant role in TTS for pitch accent languages.Therefore, we propose a modiﬁed architecture by introducing”self-attention” after LSTM layers at the encoder and decoder asillustrated in Figure 1-B. It is known that by directly connectingdistant states, self-attention relieves the high burden placed onLSTM to learn long-term dependencies to sequentially propagateinformation over long distances [11]. This extension is inspiredfrom a sequence-to-sequence neural machine translation architectureproposed by [12]. We refer to this architecture as SA-Tacotron . Our WaveNet model for

JA-Tacotron is trained by ﬁne-tuning using aground truth mel-spectrogram with a frame shift of 12.5 ms starting with anexisting model trained with a mel-spectrogram with a frame shift of 5 ms inorder to make comparison with TTS systems using vocoder parameters fairer.We use softmax distribution as an output layer of WaveNet.

The self-attention block consists of self-attention, followed by afully connected layer with tanh activation and residual connection.We use multi-head dot product attention [12] as an implementationof self-attention. This block is inserted after LSTM layers at theencoder and decoder. At the encoder, the output of CBH-LSTMlayers is processed with the self-attention block. Since LSTMcan capture the sequential relationships of inputs, we do not usepositional encoding [5]. Both self-attended representation and theoriginal output of the CBH-LSTM layers are ﬁnal outputs of theencoder.At the decoder, the two outputs from the encoder are attendedwith a dual source attention mechanism [13]. We choose a differentattention mechanism for each source, forward attention for theoutput of CBH-LSTM and additive attention for the self-attendedvalues. This is because we want to utilize the beneﬁts of both:forward attention accelerates alignment construction, and additiveattention provides ﬂexibility to select long-term information fromany segment. In addition, we can visualize both alignments. Unlikethe encoder, self-attention works autoregressively at the decoder. Ateach time step of decoding, the self-attention layer attends all pastframes of LSTM outputs and outputs only the latest frames as aprediction output. The predicted frames are fed back as input forthe next time step. Explicitly modeling the fundamental frequency ( F ) might be amore appropriate choice for TTS systems for pitch-accent languages.To incorporate F into the proposed systems, we further developeda variant of SA-Tacotron by using vocoder parameters as targets. Weuse mel-generalized cepstrum coefﬁcients (MGC) and discretized log F as vocoder parameters, and we predict these parameters withTacotron. We choose 5 ms for the frame shift to extract MGC and F as such ﬁne-grained analysis conditions are typically requiredfor reliable speech analysis based on vocoderes. However, notethat this condition is not a natural choice for training Tacotron,which typically uses coarse-grained condition, usually 12.5 msframe shifts and 50 ms frame lengths, to reduce input and outputmismatch. With a frame shift of 5 ms, the length of target vocoderparameter sequences becomes 2.5 times longer than the normal12.5 ms condition. In other words 2.5 times longer autoregressiveloop iteration is required to predict a target, so this task is muchmore challenging. To alleviate the difﬁculty, we set the reductionfactor to be three in order to reduce the target length. This settingresults in 5/3 times longer target length compared to SA-Tacotron inthe previous section. Figure 1-C shows the modiﬁed architecture of the

SA-Tacotron using MGC and log F as targets. To handle the two types of vocoderparameters, we introduce two pre-nets and three output layers at thedecoder. The output layers include a MGC prediction layer thatconsists of two fully connected layers followed by tanh and linearactivations, a log F prediction layer which is a fully connectedlayer followed by softmax activation, and a stop ﬂag prediction layer,which is a fully connected layer followed by sigmoid activation. We At training time, since all target frames are available, this computationcan be parallelized by applying a step mask. Since the decoder depends onLSTM, the whole computation cannot be parallelized, but this optimizationdecreases memory consumption because all past LSTM outputs do not needto be preserved at each time step to calculate gradients on a backward pathin backpropagation algorithm. Thanks to this optimization, we can train theextended architecture with a negligible increase in training time. We tried larger reduction factors, but the audio quality deteriorated asthe reduction factor increased. epresented discretized log F as one-hot labels at training time, butfeed back predicted probability values at inference time [14]. We useL1 loss for MGC and cross entropy error for discretized log F andstop ﬂag, and we optimize the model by using the weighted sum ofthe three losses. The cross entropy error of log F is scaled by 0.45to adjust its order to the other two loss terms.

3. EXPERIMENTS3.1. Experimental conditions

We used a Japanese speech corpus from the ATR Ximera dataset[15]. This corpus contains 28,959 utterances from a female speakerand is around 46.9 hours in duration. The linguistic features, such asphoneme and accentual-type label, were manually annotated, andthe phoneme label had 58 classes, including silence, pause, andshort pause [16]. To train our proposed systems, we trimmed thebeginning and ending silence from the utterances, after which theduration of the corpus was reduced to 33.5 hours. We used 27,999utterances for training, 480 for validation, and 142 for testing.For the experiment, we built several TTS systems as listedin Table 1. The

JA-Tacotron and

SA-Tacotron with and withoutaccentual-type labels were built to show whether the investigatedarchitectures can learn lexical pitch accents in an unsupervisedmanner. We also built a

SA-Tacotron that uses vocoder parametersinstead of mel-spectrogram as the acoustic features. In addition,we included

JA-Tacotron with forced alignment instead of predictedalignment to understand the accuracy of duration modeling better.With forced alignment, alignments are calculated with teacherforcing, and target acoustic parameters are predicted with thealignments obtained with teacher forcing. Note that, in this setting,even though forced alignments are calculated with teacher forcing,acoustic parameter prediction itself does not use teacher forcing.For

JA-Tacotron and

SA-Tacotron , we allocated 32 dimensionsfor accentual-type embedding and 224 dimensions for phonemeembedding. For the models without accentual-type embedding, 256dimensions were allocated to phoneme embedding. We set thereduction factor to be two for the models using mel-spectrogram asa target and three for the models using vocoder parameters. All thepredicted frames of the acoustic features were fed back as the nextinput. At inference time, the inference was stopped on the basis ofa binary stop ﬂag as in [3]. The network was optimized with Adamoptimizer [17]. We used exponential learning decay with an initialrate 0.0005 for the models using mel-spectrogram, and 0.002 for themodels using vocoder parameters. We implemented our proposedsystems using TensorFlow .For baseline systems, we included two classical pipeline systemsthat use vocoder parameters and mel-spectrogram [16], [18], [19].Unlike the architecture of our proposed systems, these pipelinesystems used full context labels as linguistic features and neededto have duration prediction models. To test how the accuracy ofduration prediction affects the naturalness of synthetic speech, wecompared phone duration predicted by a hidden semi-Markov model(HSMM) with oracle alignments obtained by force alignments.Finally, as a reference for how much listeners are sensitive toincorrect lexical pitch accents, a baseline with slightly corruptedaccentual labels was also included. Two types of WaveNet models were trained for the experiment,one taking the mel-spectrograms as the input and the other using the The source codes is availabe at https://github.com/nii-yamagishilab/self-attention-tacotron This system is named MOC in [16]. E n c o d e r t i m e s t e p s E n c o d e r t i m e s t e p s Fig. 2 : Alignment obtained by dual source attention in

SATMAP .Top ﬁgure shows alignment between output of encoder’s LSTMlayer and target mel-spectrogram (forward attention). Bottom ﬁgureshows alignment between output of encoder’s self-attention blockand target mel-spectrogram (additive attention). Vertical white linesindicate accentual phrase boundaries obtained by forward attention.MGC and F (vocoder parameters). These two WaveNets had thesame network structure as that in our previous study [19]. Figure 2 shows a visualizationof the attention layers of

SA-Tacotron learned on the Japanesecorpus. The ﬁrst ﬁgure from the top shows the alignment of anencoder LSTM source and mel-spectrogram target for dual sourceattention. We can clearly see a sharp monotonic alignment formedby the forward attention. The second ﬁgure from the top shows thealignment of an encoder self-attention source and mel-spectrogramtarget. It seems to be related to accentual phrase segments and phrasebreaks divided by pauses.

What is the effect of accentual-type labels?:

Figure 3 showspredicted mel-spectrograms from

SA-Tacotron with and withoutaccentual-type labels. Accentual phrase boundaries predicted by theattention mechanism are also shown in the ﬁgure. From this ﬁgure,through comparison with a natural spectrogram, we see that thepredicted spectrogram from

SA-Tacotron without labels has wrongaccentual positions and harmonics, whereas that from

SA-Tacotron with labels does not. From informal listening, we also noticed that

SA-Tacotron without labels had incorrect accent nucleus positions.

Comparison of mel-spectrogram and vocoder parameters:

The alignment between source phoneme and target spectrogramframes should monotonically increase. Non-monotonic alignmentmay result in mispronunciation, some phonemes being skipped,repetition, the same phoneme continuing, and intermediatetermination. We therefore manually counted abnormal alignmenterrors included in the test set. We observed no alignment errorsfor

JA-Tacotron and

SA-Tacotron using mel-spectrograms as atarget. However, alignment errors were found for

SA-Tacotron usingvocoder parameters due to the longer length than the correspondingmel-spectrogram. We found 19 alignment errors out of 142 testutterances.

We recruited 236 native Japanese speakers as listeners bycrowdsourcing. The listeners evaluated 32 samples from 16 systemsin a single test set. This includes natural speech and analysis bysynthesis (copy synthesis). One listener can evaluated at most 10test sets. One sample was evaluated 20 times and we got 45,440data points in total. Figure 4 shows ﬁve-point mean opinion scores ig. 3 : Natural mel-spectrogram (top ﬁgure), mel-spectrogrampredicted from

SA-Tacotron with accentual-type labels (middleﬁgure), and mel-spectrogram predicted from

SA-Tacotron withoutlabels (bottom ﬁgure). Black arrow in the bottom ﬁgure pointswrong harmonics that results in wrong accent. White lines showaccentual phrase boundaries acquired from attention’s output.

Table 1 : TTS systems used for our analysis. Notations are V :vocoder parameters, M : mel spectrogram, A : accentual type label, N :no accentual type label, P : predicted alignment, F : forced alignment. System Architecture Acoustic feature Accent label Alignment

SATVAP

SA-Tacotron

MGC & F (cid:88) predicted SATMAP

Mel-spec. 12.5 ms (cid:88) predicted

SATMNP

Mel-spec. 12.5 ms N/A predicted

TACMAP

JA-Tacotron

Mel-spec. 12.5 ms (cid:88) predicted

TACMAF (cid:88) force-aligned

TACMNP

N/A predicted

TACMNF

N/A force-aligned

PIPVAF

Pipeline[16, 19] MGC & F (cid:88) force-aligned PIPVAP

MGC & F (cid:88) predicted PIPVCF

MGC & F corrupted force-aligned PIPMAF

Mel-spec. 5 ms (cid:88) force-aligned

PIPMAP

Mel-spec. 5 ms (cid:88) predicted of the proposed and baseline systems for the listening test results.Statistical signiﬁcance was analyzed using the two-sided Mann-Whitney statistical test.

What is the effect of accentual-type labels?:

All proposed systemswithout accentual-type labels got signiﬁcantly lower scores thanthe corresponding systems with labels; for example, JA-Tacotronwithout labels had a score of . ± . whereas JA-Tacotron with labels got . ± . . This means that the architecturesof the proposed systems cannot learn lexical pitch accents in anunsupervised fashion and require additional inputs. The pipelinesystem with corrupted labels also showed a signiﬁcant drop with ascore of . ± . . This shows that incorrect accents affectedlistener’s judgments towards the naturalness of the synthetic speech. Does self-attention help?:

SA-Tacotron had better scores than

JA- analysis by synthesis pipeline proposed systems (5 ms)(12.5 ms)

Fig. 4 : Box plots of MOS scores of each system regardingnaturalness of synthetic speech. Red circles represent averagevalues. NAT indicates natural speech. Refer to Table 1 for notations.

Tacotron for each condition with or without accentual-type labels.This indicates that self-attention layers have a positive effect on thenaturalness. Among our proposed systems,

SA-Tacotron with labels(

SATMAP ) got the highest score of . ± . . Comparison of mel-spectrogram and vocoder parameters:

SA-Tacotron using vocoder parameters got a relatively low score, . ± . , even if it used accentual-type labels and self-attention layers.This is because this system generated alignment errors due to theprediction of longer sequences as we described in the previoussection. Among the baseline systems, the systems using MGC and F had higher scores than the systems using mel-spectrogram underboth the forced and predicted alignment conditions. Comparison of predicted and forced alignment:

Interestingly,

JA-Tacotron using forced alignment got lower scores than thatusing predicted alignment under both conditions with and withoutaccentual-type labels. This result is surprising because, in traditionalpipelines, forced alignment is used as an oracle alignment andnormally leads to better perceptual quality than that of the predictedcase. Since Tacotron learns both spectrograms and alignmentssimultaneously, it seems to produce the best spectrograms whenit infers both of them. Among the baseline pipeline systems, asexpected, a forced alignment gave higher scores than predictedalignment for both systems using vocoder parameters and mel-spectrogram. In the case of predicted alignment, the score has along tail variance towards the low score region.

Comparison of pipeline and Tacotron systems:

The best proposedsystem still does not match the quality of the best pipeline system.

SA-Tacotron with accentual-type labels and the pipeline systemusing mel-spectrogram and predicted alignment had . ± . and . ± . , respectively. These are not the same results as for theEnglish experiments reported in [3]. One major difference of ourproposed systems from pipeline systems other than architecture isinput linguistic features; our proposed systems use phoneme andaccentual-type labels only, but the baseline pipeline systems usevarious linguistic labels including word-level information such asinﬂected forms, conjugation types, and part-of-speech tags. Inparticular, an investigation on the same Japanese corpus foundthat the conjugation type of the next word is quite useful for F prediction [20].

4. CONCLUSION

In this paper, we applied Tacotron to Japanese to extend it to apitch-accent language. We proposed phone-based Tacotrons withand without accentual-type labels, one with self-attention layersto capture long term information better, and one using vocoderparameters including fundamental frequency. We conductedobjective and subjective evaluations. Among the proposed systems,Tacotron with the self-attention extension outperformed that withoutself-attention both with and without labels. However, we revealedthat, unlike experiments reported for English, the quality oftraditional pipeline systems is better than the proposed systemsfor Japanese. We also found that choosing vocoder parameters isbeneﬁcial to pipeline systems, but this is completely opposite for thecase of Tacotron.One major difference of our proposed systems from the pipelinesystems is the absence of word level information in linguisticfeatures, so incorporating this information may improve the qualityof the proposed systems and bring them up to the pipeline system’slevel. Our next step towards end-to-end speech synthesis in variouslanguages is to incorporate word-level information such as Kanji.

Acknowledgements

We are grateful to Prof. Zhen-Hua Ling fromUSTC for kindly answering our questions. . REFERENCES [1] Yuxuan Wang, R.J. Skerry-Ryan, Daisy Stanton, YonghuiWu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, YingXiao, Zhifeng Chen, Samy Bengio, Quoc Le, YannisAgiomyrgiannakis, Rob Clark, and Rif A. Saurous, “Tacotron:Towards end-to-end speech synthesis,” in

Proc. Interspeech ,2017, pp. 4006–4010.[2] A¨aron van den Oord, Sander Dieleman, Heiga Zen, KarenSimonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,Andrew W. Senior, and Koray Kavukcuoglu, “Wavenet: Agenerative model for raw audio,”

CoRR , vol. abs/1609.03499,2016.[3] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster,Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang,Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, YannisAgiomyrgiannakis, and Yonghui Wu, “Natural TTS synthesisby conditioning WaveNet on Mel spectrogram predictions,” in

Proc. ICASSP , 2018, pp. 4779–4783.[4] Wei Ping, Kainan Peng, and Jitong Chen, “Clarinet: Parallelwave generation in end-to-end text-to-speech,”

CoRR , vol.abs/1807.07281, 2018.[5] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, MingLiu, and Ming Zhou, “Close to human quality TTS withtransformer,”

CoRR , vol. abs/1809.08895, 2018.[6] Antoine Bruguier, Heiga Zen, and Arkady Arkhangorodsky,“Sequence-to-sequence neural network model with 2Dattention for learning japanese pitch accents,” in

Proc.Interspeech , 2018, pp. 1284–1287.[7] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence tosequence learning with neural networks,” in

Proc. NIPS , 2014,pp. 3104–3112.[8] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,“Neural machine translation by jointly learning to align andtranslate,” in

Proc. ICLR , 2015.[9] Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai, “Forwardattention in sequence-to-sequence acoustic modeling forspeech synthesis,” in

Proc. ICASSP . IEEE, 2018, pp. 4789–4793.[10] David Krueger, Tegan Maharaj, J´anos Kram´ar, MohammadPezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal,Yoshua Bengio, Aaron Courville, and Chris Pal, “Zoneout:Regularizing rnns by randomly preserving hidden activations,”in

Proc. ICLR , 2017.[11] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos,Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio, “Astructured self-attentive sentence embedding,” in

Proc. ICLR ,2017.[12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and IlliaPolosukhin, “Attention is all you need,” in

Proc. NIPS , 2017,pp. 6000–6010.[13] Barret Zoph and Kevin Knight, “Multi-source neuraltranslation,” in

Proc. NAACL-HLT , 2016, pp. 30–34.[14] Xin Wang, ShinjiTakaki, and Junichi Yamagishi, “Autoregressive neural f0model for statistical parametric speech synthesis,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol.26, no. 8, pp. 1406–1419, Aug 2018. [15] H Kawai, T Toda, J Yamagishi, T Hirai, J Ni, N Nishizawa,M Tsuzaki, and K Tokuda, “Ximera: A concatenativespeech synthesis system with large scale corpora,”

IEICETransactions on Information and System (Japanese Edition) ,pp. 2688–2698, 2006.[16] Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, and NobuyukiNishizawa, “Investigating accuracy of pitch-accent annotationsin neural-network-based speech synthesis and denoisingeffects,” in

Proc. Interspeech , 2018, pp. 37–41.[17] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” in

Proc. ICLR , 2014.[18] Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao Echizen,Junichi Yamagishi, and Tomi Kinnunen, “Can we stealyour vocal identity from the internet?: Initial investigationof cloning obamas voice using gan, wavenet and low-qualityfound data,” in

Proc. Odyssey 2018 The Speaker and LanguageRecognition Workshop , 2018, pp. 240–247.[19] Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela,and Junichi Yamagishi, “A comparison of recent waveformgeneration and acoustic modeling methods for neural-network-based speech synthesis,” in

Proc. ICASSP , 2018, pp. 4804–4808.[20] Xin Wang,

Fundamental Frequency Modeling for Neural-Network-Based Statistical Parametric Speech Synthesis , Ph.D.thesis, Department of Informatics, SOKENDAI, 2018. . HYPER-PARAMETERS

Table 2 shows the hyper-parameters used for

SA-Tacotron withaccentual-type embedding.

Table 2 : Hyper-parameters

Frame length, shift 50 ms, 12.5 msSample rate, FFT size 48 kHz, 4096Embeddings Phoneme: 224-D, Accent: 32-DEncoder pre-net Phoneme: 224/112-D, Accent: 32/16-DAttention RNN 256-D cells, 10-D kernels, 5-D ﬁltersEncoder & decoder LSTM 256-D cells, 10 % zoneout rateEncoder self-attention 32-D, 2 heads, 1 hop, 5 % drop rateDecoder self-attention 256-D, 2 heads, 1 hop, 5 % drop rate

B. OBJECTIVE EVALUATION OF PREDICTED F0

Furthermore, we conducted an objective evaluation for four variantsystems by using vocoder parameters such as JA-Tacotron and SA-Tacotron with and without accentual-type label.To evaluate the F prediction capability of JA-Tacotron and SA-Tacotron, we evaluated the objective metrics of F . To calculate themetrics we adjusted the frames between the predicted and groundtruth F by using force alignment. JA-Tacotron using vocoderparameters without labels failed to learn alignments between thesource and target, so we did not include it. Table 3 : Objective evaluation of F predicted by JA-Tacotron andSA-Tacotron. System accent RMSE CORR U/V

TACVAF (cid:88)

SATVAF (cid:88)

SATVNF

N/A 39.30 0.79 7.04 %

PIPVAF (cid:88)

PIPVCF corrupted 31.09 0.89 3.29 %

Table 3 shows RMSE, correlation and U/V errors of F . BothJA-Tacotron and SA-Tacotron with accentual-type labels had an F correlation value of 0.88. This indicates that self-attention hadno effect on F prediction accuracy. The systems without labelsshow lower correlation of 0.79 compared to the systems with labels,because of wrong accents as described in Section 3.2. Althoughthese values were still lower than a baseline pipeline system that had0.94 [16], we think they are good enough considering that a frameshift of 5 ms is not the best condition for Tacotron. In addition,forced alignment itself has a negative effect on audio quality inTacotron as we described in Section 3.3. The baseline system withnoisy accentual-type labels had a correlation of 0.89. The noisybaseline had artiﬁcial accent errors with a probability of 50 %. Eventhough this is almost same as the correlation values of our proposedsystems with accentual-type label, we do not think our proposedsystems has accent errors with a probability of 50 %. As can beseen in the listening test result in Section 3.3, our proposed systemsusing mel-spectrogram with labels outperform the noisy baseline, sothe relatively low correlation of F was caused by the unsuitableconditions of acoustic features for Tacotron. C. VISUAL AND STATISTICAL ANALYSIS OFSELF-ATTENTION AT ENCODER AND DECODER

Because alignments of self-attention at encoder in

SA-Tacotron arehard to interpret at sample level, we conducted statistical analysis foralignment scores of self-attention in a test sets. We calculate mean D e c o d e r t i m e s t e p s Fig. 5 : Alignment visualization of two heads from self-attentionlayer at decoder in

SA-Tacotron . Three horizontal bands withrelatively high activation correspond to pause positions.alignment scores of phoneme pairs that frequently occur (more than30 times). In head 1 we found some alignments based on similarity.For example, top three phonemes with high score values are identicalphoneme pairs and there are 11 identical phoneme pairs within top100. In addition, we found 13 phoneme pairs that belong to samegroup (e.g. long vowels) within top 100. The head 2 showed strongafﬁnity to silence and pauses; we found 88 pairs that include pauseor silence out of top 100 pairs with high alignment scores.Fig. 5 shows alignments of two heads from decoder self-attention in