Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning
Devang S Ram Mohan, Raphael Lenain, Lorenzo Foglianti, Tian Huey Teh, Marlene Staib, Alexandra Torresquintero, Jiameng Gao
IIncremental Text to Speech for Neural Sequence-to-Sequence Models usingReinforcement Learning
Devang S Ram Mohan , Raphael Lenain * , Lorenzo Foglianti , Tian Huey Teh , Marlene Staib ,Alexandra Torresquintero , Jiameng Gao Papercup Technologies Ltd., Novoic [email protected]
Abstract
Modern approaches to text to speech require the entire inputcharacter sequence to be processed before any audio is synthe-sised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. Interleaving theaction of reading a character with that of synthesising audio re-duces this latency. However, the order of this sequence of inter-leaved actions varies across sentences, which raises the questionof how the actions should be chosen. We propose a reinforce-ment learning based framework to train an agent to make thisdecision. We compare our performance against that of deter-ministic, rule-based systems. Our results demonstrate that ouragent successfully balances the trade-off between the latency ofaudio generation and the quality of synthesised audio. Morebroadly, we show that neural sequence-to-sequence models canbe adapted to run in an incremental manner.
Index Terms : text to speech, reinforcement learning
1. Introduction
Efforts towards incremental text to speech (TTS) have typicallyfocused on more traditional, non-neural architectures [1, 2, 3].However, advancements in neural TTS [4, 5, 6] have resultedin near human levels of naturalness and thus motivate an explo-ration of neural incremental TTS systems.Neural TTS systems typically adopt sequence-to-sequencearchitectures which require the entire input sequence to be pro-cessed before generating any units of the output sequence. Thisoffline characteristic is often useful; for example, a questionmark at the end of a sentence would impact the intonation ofpreceding words. On the other hand, synthesising speech in-crementally from text could be valuable. Such a model couldbe placed at the tail-end of an incremental speech recognitionand machine translation pipeline to obtain a real-time speech-to-speech translation system.The development of these streaming, end-to-end architec-tures has seen considerable attention for the tasks of automaticspeech recognition [7, 8, 9, 10] and machine translation (MT)[11, 12, 13, 14]. Inspired by the approach of [12], our proposedframework develops an agent that decides whether to triggerthe encoder with the next input character (i.e.,
READ in Figure1), or trigger the decoder with the characters read thus far (i.e.,
SPEAK in Figure 1). In this manner, our approach enables us tostart generating mel-spectrograms while having read only a partof the input sentence. The mapping of these mel-spectrogramframes to raw audio waveforms can be achieved with an existingneural vocoder [4, 15] by adjusting its inference behaviour.The challenge then lies in deciding when to incorporate anadditional character into this restricted input subsequence. We * work done while at Papercup Technologies Ltd. Figure 1:
Trajectory for an arbitrary sequence of READ (red,along y-axis) and SPEAK actions (blue, along x-axis) use the REINFORCE algorithm [16] to train an agent to makethis decision.
2. Background [17] proposes an approach for incremental neural TTS. Themodel is based on the prefix-to-prefix framework [18] and lever-ages a policy which maintains a fixed latency (in terms of num-ber of words) behind the input. However, it would be chal-lenging to construct such a rule-based approach if the desiredlatency was to be measured in a more granular unit, such ascharacters or phonemes. Furthermore, a dynamic, learnt policywould allow this approach to be used for new languages andspeakers without manual calibration of these parameters.The arena of incremental machine translation has also seenadvancements. [11] proposes the framework of
READ / WRITE and once again uses rule-based policies to enable incrementalmachine translation. [12] models this discrete action selec-tion task using a reinforcement learning (RL) system, whichwe adapt in our work. Alternatively, [13] turns this non-differentiable framework into a supervised learning problem bytraining a model on sequences of interleaved
READ / WRITE de-cisions generated from a pre-trained model.A major challenge in any sequence transduction task is toalign the target sequence with the source at each step. [7, 8] pro-pose methods that leverage the RNN-T model [19] to addressthis for the task of speech recognition. As an alternative, the ap-proaches in [9, 10] propose architectures which utilise the factthat in speech recognition, the length of the target sequence isless than that of the source. [20, 21, 22] use encoder-decoder ar-chitectures with attention, but compute the attention alignments a r X i v : . [ ee ss . A S ] A ug n an online manner.[23] adapts the online, monotonic attention mechanism pro-posed by [24] for the Tacotron 2 model. However, the motiva-tion behind this was to ensure the surjectivity of the mappingbetween input elements and output frames and thus, the en-coder and decoder architectures remain offline. Furthermore,the atomic input unit is a phoneme which can only be computedgiven the entire word. RL based approaches have also beenused to generate attention weights for image captioning [25],[26, 27]. However, these attention mechanisms generate hardattention weights which is undesirable for TTS [28].
3. Tacotron 2 Modifications
Our base model builds on the Tacotron 2 model, with certainmodifications for the incremental setting. Note that while thesemodifications may affect the quality of synthesised speech, theyare necessary restrictions for incremental synthesis.The encoder is altered by simply removing the convolu-tional layers and replacing the bi-directional LSTM [29, 30]with a uni-directional one. We further discard the post-netmodule, leaving only the attention mechanism that renders thismodel offline . Rather than modifying the computation of thealignment weights and potentially enforcing a hardness con-straint, we maintain the soft attention weights and suitably re-strict its scope as described in Section 4.Finally, note that Tacotron 2 also has a vocoder component,which maps the mel-spectrogram to the raw audio waveform.We use a different vocoder architecture [15] and adapt its infer-ence behaviour to work in a purely auto-regressive manner byrestricting the number of mel-spectrogram frames input to itsresidual and up-sampling networks.For the remainder of this paper, we use this modifiedTacotron 2 architecture to generate mel-spectrograms with theunderstanding that any incremental vocoder can be leveragedfor synthesis.
4. Incremental Text to Speech usingReinforcement Learning
Inspired by [12], we maintain an increasing buffer of input char-acters, which the model attends over to synthesise the next mel-spectrogram frame. We then train an agent to make the decisionof whether to add the next input character into this buffer, orto synthesise a frame of audio based on the information in thebuffer. To train this agent, we leverage the RL paradigm.
The RL setup consists of a decision maker, called the agent ,interacting with an environment , typically over a sequence ofdiscrete steps which we index by j . At the j th interaction step,the agent selects an action a j , which the environment executes,and returns a new observation o j +1 (which is a representationof how its internal state has changed) and a numerical reward, r j +1 . In addition, the environment returns a flag which indi-cates whether this particular episode of interactions has com-pleted, called the terminal flag. The task for the agent, then, isto learn a mapping from the space of all possible observationsto a suitable action. Such a mapping, called a policy , shouldattempt to maximise the cumulative numerical reward achievedover the course of an episode (typically discounted temporallyby a factor γ ∈ [0 , ) [31].Formally, let x , ..., x N denote the sequence of input char- acter embeddings and h , ..., h N denote the corresponding en-coder outputs from our modified Tacotron 2 (Section 3). Ourmodifications enable h i to be computed without knowledge of x i +1 , ..., x N . Let the associated ground-truth mel-spectrogram y ∈ R × T consist of T frames. At the j th step of an episode,let R ( j ) ∈ { , ..., N } denote the number of characters that havebeen read and S ( j ) ∈ { , ...T } represent the number of audioframes generated (aligned by teacher-forcing [4] during train-ing). Let α i,S ( j ) denote the alignment weight over h i whilegenerating the S ( j ) th decoder output, ˆy S ( j ) .Instead of using { h , ..., h N } to compute these weights(and thence the attention context), we use our restricted buffer { h , ..., h R ( j ) } . This approach guarantees that, at the time ofsynthesising the S ( j ) th frame of audio, our Tacotron 2 modelonly has access to the first R ( j ) characters. The actions available to the agent are:•
READ : (step along the vertical axis in Figure 1) Pro-vides the attention mechanism with an additional char-acter over which it may attend.•
SPEAK : (step along the horizontal axis in Figure 1) Re-sults in the generation of a mel-spectrogram frame basedon the characters read thus far.Then, a desirable learnt policy might be the agent learningto
SPEAK as soon as there is enough
READ context, and to re-sume
READ ing only when the existing context is fully synthe-sised. Observe that the offline behaviour can also be obtainedas a specific policy (
READ all characters and then
SPEAK untilall frames are synthesised).
The environment uses a trained modified Tacotron 2 model toprovide the agent with the requisite information and feedback.
Suppose we have just received action a j − . The environmentincrements the appropriate counter ( R ( j ) or S ( j ) , based on a j − ) and passes h , ..., h R ( j ) to the attention module, whichcomputes α ,S ( j ) , ..., α R ( j ) ,S ( j ) . The context vector is then c S ( j ) = R ( j ) (cid:88) i =1 α i,S ( j ) h i (1)Since we want o j to contain enough information for theagent to decide whether to READ or SPEAK , we define o j to bethe concatenation of:• c S ( j ) : The attention context vector based on the R ( j ) characters read thus far.• α .,S ( j ) [ k :] : A fixed length moving window of the latestattention weights. This term was found to be crucial forlearning a good policy.• y S ( j ) (during training) or ˆy S ( j ) (during evaluation): Themost recent mel-spectrogram frame. Underpinning our RL framework is the understanding that thequality of the generated output may trade-off against the delayincurred. Thus, we define our reward as r j := r Dj + r Qj (2)here r Dj encourages low latency while r Qj encourages highquality synthesis. Motivated by the treatment in [12], we define r Dj := r CRj + r APj (3)where• r CRj is a local signal to discourage consecutive
READ actions r CRj := ω × ( sgn ( c j − c ∗ ) + 1) (4) c j is a counter for consecutive READ s, c ∗ is an accept-able number of consecutive READ s and ω < is ahyper-parameter.• r APJ is a global penalty incurred only at the end of anepisode r APJ := β × (cid:98) d T − d ∗ (cid:99) + (5)Geometrically, d T corresponds to the average proportionof area under the policy path (Figure 1). A value of for d T corresponds to READ ing the entire input sequencebefore generating any output, while corresponds to theunattainable scenario of synthesising all the audio with-out READ ing any characters. d ∗ is a target value for d T and β < is a hyper-parameter.Prior works in MT [12, 18] have a detailed description of theseterms.To compute r Qj , we use the mean squared error (MSE)between the ground truth and generated mel-spectrograms(aligned using teacher forcing). While the MSE is limited as ameasure of perceived quality [32], its usage as a training objec-tive for our underlying Tacotron 2 model suggests it is suitablefor our setting. We obtain a quality penalty term given by r Qj := λ × MSE ( y S ( j ) , ˆy S ( j ) ) (6)where λ < . When a READ is executed, r Qj is set to . At train time, there are two ways that the episode can terminate:• R ( j ) = N (all the characters have been read) At thispoint the agent is forced to SPEAK until S ( t ) = T . It isthen given a cumulative reward for these SPEAK actions.• S ( j ) = T (all the aligned mel-spectrograms have beenconsumed) At this point, the agent is given an additionalpenalty equal to the number of unread characters and theepisode is terminated.During inference, the episode runs until our Tacotron 2model’s termination criterion (i.e., the stop token) is triggered. The agent receives an observation o j which is passed througha policy network consisting of a 512-dimensional GRU unit, a2 layer dense network with ReLU non-linearity, and a softmaxlayer, to produce a 2-dimensional vector of action probabilities.To learn these policy parameters θ , we use the policy gra-dient method [16] which maximises expected cumulative dis-counted reward. However, as a variance reduction technique,we replace the discounted returns G j in the update, with a nor-malised advantage value [33]. To compute this we subtract abaseline return, b φ ( o j ) (where φ parameterises a 3-layer fullyconnected network), and then normalise the result [33, 34]. To learn the baseline network parameters φ , we minimise the ex-pected squared loss between G j and b φ ( o j ) .For both terms, the expectation is approximated by sam-pling a trajectory under the policy π θ . All parameters are trainedjointly on collected batches of transitions.
5. Experiments
We use the LJ Speech dataset [35], which consists of Englishaudio from a single speaker. We partition this dataset into12,000 train and 1,100 test/validation data points. We train ourmodified Tacotron 2 model for 300,000 iterations following thetraining routine in [4].We set the weights of each reward component, ω = − , β = − and λ = − , to ensure that the scale of contribu-tion is comparable. The target number of consecutive charactersread, c ∗ is set to while the target average proportion of areaunder the policy path, d ∗ is set to . . These values are inter-pretable levers that allow the model’s behaviour to be tweaked.The look-back of the attention window was set to .During training, actions are sampled according to the prob-abilities returned by the policy to encourage exploration of theobservation space. While evaluating, actions are chosen greed-ily. We use a discount factor of . and train on batches ofcollected transitions at the end of every episodes, using anAdam optimiser [36] initialised with a learning rate of − . To gauge the performance of our agent, we used two types ofbenchmark policies, inspired by [17, 12]:
Wait-Until-End (WUE) : Execute READ actions until thetext buffer is empty and then decode everything . Since this pol-icy has access to the entire input sentence at the time of decod-ing, this gives an upper bound on the quality of the synthesisedspeech, at the cost of the largest possible delay.
Wait-k-Steps (WkS) : Execute a READ action every k steps, and decode in between . Despite incurring a smaller delay,the restricted access to the input sentence while decoding mayimpact the quality of the generated speech. Figure 2 depicts the attention alignments and policy path for asample sentence . Figures 2a and 2b show that, for a large partof the decoding process, the WUE and W2S policies have ac-cess to more characters than required which highlights an avoid-able latency. Figure 2c suggests that the W3S policy is able toreduce these unnecessary READ s. However, the resulting policypath appears to collide with the ‘prominent’ alignments on mul-tiple occasions. As a result, the audio quality at these points iscompromised because the decoder does not have sufficient con-text. This motivates the idea that an ideal policy path should hug the prominent alignments diagonal closely from above to suc-cessfully balance the quality of synthesis and latency incurred.Our learnt policy (Figure 2d) does precisely that. This suggeststhat the agent has in fact learnt to
READ only when necessaryand
SPEAK only when it has something relevant to output. English (and French) audio samples can be found athttps://research.papercup.com/samples/incremental-text-to-speech a) Wait Until End (WUE) Policy (b)
Wait 2 Steps (W2S) Policy (c)
Wait 3 Steps (W3S) Policy (d)
Learnt Policy
Figure 2:
Policy Path with Attention Alignments (English): Each plot depicts the policy path and the attention alignments (by colour).The greyed out section represents portions of the input sentence that is unavailable as those input characters have not yet been read.
Figure 3:
Average WER vs Latency ( d T ) on a test set comprising40 samples from LJ Speech labelled by 5 annotators There are two aspects of the agent’s performance that we track:
Quality:
We compute the Mean Opinion Score (MOS) tomeasure the naturalness of our audio [37, 4]. We consideredusing a MUSHRA test [38]. However, since some policies maygenerate unintelligible samples of audio, which in turn could bescored below a noisy anchor, this approach was set aside. Weare also interested in measuring the intelligibility of the synthe-sised speech. Automatic speech recognition systems use word-error rate (WER) to measure the transcription quality [39]. Fol-lowing this approach, we obtain human transcriptions of thespeech and compute the WER against the ground truth.
Latency:
We use the proportion of area under the policypath, d T ∈ [0 , described in Section 4.3. This metric lacksinterpretability in terms of the actual delay incurred (e.g. thenumber of extra characters read). An alternate average laggingmetric has been proposed in the MT setting [18]. However,the skewed ratio between the source and target lengths for TTScoupled with a soft alignment between source and target makethis metric challenging to adapt to TTS. Figures 3 and 4 depict the inherent trade-off between qualityand latency. The ground truth marker depicts the value of therelevant metric for the vocoded ground truth mel-spectrograms.We begin by observing that the W3S policy incurs the leastdelay, closely followed by our online agent, while the W2Sand WUE policies incur substantial delays. In terms of intel-ligibility, our online agent achieves a better WER than W3S,and even outperforms W2S despite its sizeable latency advan- Figure 4:
Average MOS vs Latency ( d T ) on a test set comprising40 samples from LJ Speech labelled by 10 evaluators tage. In terms of naturalness, our agent similarly outperformsW3S on MOS, but in this case, W2S was, as expected, able toleverage the additional latency to produce more natural sound-ing speech.These findings establish that our agent is able to learn a pol-icy that successfully balances the quality of the synthesised out-put against the latency incurred. The W2S policy is either com-parable (intelligibility) or marginally better (naturalness) thanour online agent, but in doing so, performs a large number ofpremature READ actions. Our agent incurs a slightly larger de-lay than the W3S policy, and manages to outperform it on allquality metrics.
6. Future Work
Our results show that for neural sequence-to-sequence,attention-based TTS models, there is no algorithmic barrier toincrementally synthesising speech from text. It is also inter-esting to analyse the learnt policy for different languages giventhe varied challenges posed (eg. elisions and liasons in French[40]). We provide samples from an agent trained on the FrenchSIWIS dataset [41] with the same setup as described, on oursamples page .Furthermore, we used a modified Tacotron 2 model, pre-trained on full sentences. It would be interesting to analysewhether jointly learning the Tacotron weights helps synthesisepartial fragments of a sentence better. . Acknowledgements We would like to thank Simon King, Mark Herbster and MarkGales for their valuable input on this research.
8. References [1] T. Baumann and D. Schlangen, “The inprotk 2012 release,” in
NAACL-HLT Workshop on Future Directions and Needs in theSpoken Dialog Community: Tools and Data . Association forComputational Linguistics, 2012, pp. 29–32.[2] ——, “INPRO iSS: A component for just-in-time incrementalspeech synthesis,” in
Proceedings of the ACL 2012 SystemDemonstrations
Sixteenth AnnualConference of the International Speech Communication Associa-tion , 2015.[4] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Natural TTSsynthesis by conditioning WaveNet on mel spectrogram predic-tions,” in
ICASSP . IEEE, 2018, pp. 4779–4783.[5] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner,A. Courville, and Y. Bengio, “Char2wav: End-to-end speech syn-thesis,”
ICLR - Workshop Track , 2017.[6] S. Vasquez and M. Lewis, “Melnet: A generative model for au-dio in the frequency domain,” arXiv preprint arXiv:1906.01083 ,2019.[7] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Al-varez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang et al. ,“Streaming end-to-end speech recognition for mobile devices,” in
ICASSP . IEEE, 2019, pp. 6381–6385.[8] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, andS. Kumar, “Transformer transducer: A streamable speech recog-nition model with transformer encoders and RNN-T loss,” arXivpreprint arXiv:2002.02562 , 2020.[9] H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neuralaligner: An encoder-decoder neural network model for sequenceto sequence mapping.” in
INTERSPEECH , 2017, pp. 1298–1302.[10] N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever,and S. Bengio, “A neural transducer,” arXiv preprintarXiv:1511.04868 , 2015.[11] K. Cho and M. Esipova, “Can neural machine translation dosimultaneous translation?” arXiv preprint arXiv:1606.02012 ,2016. [Online]. Available: http://arxiv.org/abs/1606.02012[12] J. Gu, G. Neubig, K. Cho, and V. O. Li, “Learning to trans-late in real-time with neural machine translation,” arXiv preprintarXiv:1610.00388 , 2016.[13] B. Zheng, R. Zheng, M. Ma, and L. Huang, “Simpler and fasterlearning of adaptive policies for simultaneous translation,”
Pro-ceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing , pp. 1349–1354, 2019.[14] N. Arivazhagan, C. Cherry, W. Macherey, C.-C. Chiu, S. Yavuz,R. Pang, W. Li, and C. Raffel, “Monotonic infinite lookback at-tention for simultaneous machine translation,” in
ACL , Jul. 2019,pp. 1313–1323.[15] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord,S. Dieleman, and K. Kavukcuoglu, “Efficient neural audiosynthesis,”
PMLR , vol. 80, pp. 2410–2419, 2018.[16] R. J. Williams, “Simple statistical gradient-following algorithmsfor connectionist reinforcement learning,”
Machine learning ,vol. 8, no. 3-4, pp. 229–256, 1992.[17] M. Ma, B. Zheng, K. Liu, R. Zheng, H. Liu, K. Peng, K. Church,and L. Huang, “Incremental text-to-speech synthesis with prefix-to-prefix framework,” arXiv preprint arXiv:1911.02750 , 2019. [18] M. Ma, L. Huang, H. Xiong, K. Liu, C. Zhang, Z. He, H. Liu,X. Li, and H. Wang, “STACL: Simultaneous translation withintegrated anticipation and controllable latency,” arXiv preprintarXiv:1810.08398 , 2018.[19] A. Graves, “Sequence transduction with recurrent neural net-works,” arXiv preprint arXiv:1211.3711 , 2012.[20] C. Chiu and C. Raffel, “Monotonic chunkwise attention,”
ICLR ,2018.[21] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Ben-gio, “End-to-end attention-based large vocabulary speech recog-nition,” in
ICASSP . IEEE, 2016, pp. 4945–4949.[22] A. Tjandra, S. Sakti, and S. Nakamura, “Local monotonicattention mechanism for end-to-end speech and languageprocessing,” in
Proceedings of the Eighth International JointConference on Natural Language Processing (Volume 1: LongPapers)
INTERSPEECH , 2019, pp. 1293–1297.[24] C. Raffel, M.-T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Onlineand linear-time attention by enforcing monotonic alignments,” in
Proceedings of the 34th International Conference on MachineLearning-Volume 70 . JMLR. org, 2017, pp. 2837–2846.[25] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,R. Zemel, and Y. Bengio, “Show, attend and tell: Neural imagecaption generation with visual attention,” in
International confer-ence on machine learning , 2015, pp. 2048–2057.[26] W. Zaremba and I. Sutskever, “Reinforcement learning neural tur-ing machines,” arXiv preprint arXiv:1505.00521 , 2015.[27] J. Ling, “Coarse-to-fine attention models for document summa-rization,” Ph.D. dissertation, 2017.[28] E. Battenberg, R. Skerry-Ryan, S. Mariooryad, D. Stanton,D. Kao, M. Shannon, and T. Bagby, “Location-relative attentionmechanisms for robust long-form speech synthesis,” 2019.[29] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neu-ral networks,”
IEEE Transactions on Signal Processing , vol. 45,no. 11, pp. 2673–2681, 1997.[30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997.[31] R. S. Sutton et al. , Introduction to reinforcement learning , 2nd ed.,1998.[32] D. Elbaz and M. Zibulevsky, “Perceptual audio loss function fordeep learning,” arXiv preprint arXiv:1708.05987 , 2017.[33] A. Mnih and K. Gregor, “Neural variational inference and learn-ing in belief networks,”
ICML , 2014.[34] D. Silver, “Lectures on reinforcement learning,” 2015.[35] K. Ito, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.[36] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[37] R. C. Streijl, S. Winkler, and D. S. Hands, “Mean opinion score(MOS) revisited: methods and applications, limitations and alter-natives,”
Multimedia Systems , vol. 22, no. 2, pp. 213–227, 2016.[38] B. Series,
Method for the subjective assessment of intermediatequality level of audio systems . Geneva: International Telecom-munication Union, 2015.[39] A. Ali and S. Renals, “Word error rate estimation for speechrecognition: e-WER,” in