[PDF] ICE-Talk: an Interface for a Controllable Expressive Talking Machine

Abstract

ICE-Talk is an open source web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. Moreover it is implemented as a module that can be used as part of a Human-Agent interaction.

Full PDF

IICE-Talk: an Interface for a Controllable Expressive Talking Machine

No´e Tits , Kevin El Haddad , Thierry Dutoit

Numediart Institute, University of Mons { noe.tits, kevin.elhaddad, thierry.dutoit } @umons.ac.be Abstract

ICE-Talk is an open source web-based GUI that allows the useof a TTS system with controllable parameters via a text ﬁeldand a clickable 2D plot. It enables the study of latent spaces forcontrollable TTS. Moreover it is implemented as a module thatcan be used as part of a Human-Agent interaction. Index Terms : expressive, controllable, speech synthesis,human-computer interaction, deep learning, web interface

1. Introduction and Motivations

Speech Synthesis is an important component of Human-RobotInteraction [1]. However as of today, expressiveness in speechgenerated by Text-to-Speech (TTS) systems is under-exploredin such interactions. The reason is the difﬁculty of accessing thevariables controlling speech expressiveness in a deep learning-based TTS system [2].To tackle this problem we propose a tool allowing the con-trol of these variables through a graphical interface, thus con-tributing to the democratization of the use of Deep Learning(DL)-based TTS systems in Human-Agent interaction (HAI)applications. The goal of this tool is to be generic enough tobe connected in any HAI application.This interface allows for the control over the synthesis pa-rameters of a DL-based model through its latent space directlyand intuitively in a graphical way. It therefore allows the imple-mentation of several interesting applications and experimentssuch as listening tests for the evaluation of such systems thanksto easy prototyping of experiments. An example of experimentis available. This tool also enables the possibility of studyingthe impact of expressive synthesized speech in Human-Robotinteraction.

2. Related Work

As of today, there are some web interfaces allowing the use ofDL TTS models . They allow to write text, that is sent to themodel and get the synthesized speech as an audio object thatone can listen. The text is therefore the only control variablethat we can access.Recently, an interface allowing to give an audio for TTSwith speaker characteristics was developed based on the re-search of Tacotron team [3, 4]. It allows to select a referenceaudio ﬁle and synthesize speech from text imitating the voice ofthe reference. It is however not possible to interact with a latentspace representing acoustic variability.In this paper we provide a web interface capable of visual-izing and exploring a space of voice expressiveness and synthe-size corresponding expressive speech. https://github.com/noetits/ICE-Talk https://github.com/keithito/tacotron https://github.com/CorentinJ/Real-Time-Voice-Cloning

3. Description of ICE-Talk

Figure 1 depicts the different components of the system ar-chitecture. It is constituted of a DL unsupervised TTS modeltrained on an expressive dataset (see Section 3.2).To make the model available as a web service and com-municate information of text, audio and style between the webinterface and the TTS model, the Falcon Web framework isused.Falcon allows to bridge the gap between a python code anda web interface, allowing the use of Deep Learning frameworksthrough a web application.Figure 1: System architecture

As a use case, we use a modiﬁed version of

Deep ConvolutionalText-to-Speech (DCTTS) [5], a state-of-the art Deep-LearningSequence-to-Sequence (seq2seq) model with an output control-lable through a Latent Space designed to represent variations invoice style as described in [6].A TTS seq2seq model typically consists of an encoder-decoder structure. Text is encoded as a latent representation https://falcon.readthedocs.io/en/stable/ a r X i v : . [ ee ss . A S ] A ug hat is then decoded with an attention based decoder to predicta mel-spectrogram later inverted to an audio waveform.To obtain a voice style representation in [6], a mel-spectrogram encoder is added. It consists of a stack of 1D con-volutional layers, followed by an average pooling. This oper-ation ensures to obtain time-invariant information. It can thuscontain information about statistics of prosody such as pitch av-erage, average speaking rate, but not a pitch evolution. The interface contains a 2D representation of a latent spacewhich is an internal representation of the data distribution by thenetwork. This 2D representation is obtained via a dimensional-ity reduction applied to the highly dimensional latent space ofthe system. The interface also contains a text box for the sys-tem’s input and an audio player for the system’s output.The latent space represents the distribution of some control-ling parameters (the expressiveness for instance) of the outputspeech, and is obtained after training.By writing a text and clicking on a point on the 2D space,an audio signal is generated with the parameters values corre-sponding to the point clicked on.The web interface is implemented in HTML5 and javascriptto use the service.There are several possibilities for dimensional reduction :UMAP, PCA or t-SNE. The click of the mouse is detected usingjavascript in pixels coordinates and mapped to the reduced dataspace.Then Nearest Neighbour regression is used to compute the2D data point, and a lookup table gives the corresponding 8Dpoint of the latent space. The text and the 8D vector are fedto the model that generates the sentence and save it into a wavﬁle. The audio wav ﬁle is then served and played as an HTML5audio object.

4. Conclusions and Future Works

We presented an innovative interface for Controllable Expres-sive TTS called ICE-Talk that show a proof of concept of re-search results previously presented at Interspeech. It is opensource and ready to be used with available pre-trained models.As future works, other models available such as a multi-speaker TTS could be integrated to be able to generate speechfrom different speakers, not based on references but on the la-tent space describing speaker characteristics.

5. Acknowledgements

No´e Tits is funded through a FRIA grant (Fonds pour la Forma-tion `a la Recherche dans l’Industrie et l’Agriculture, Belgium).

6. References [1] N. Tits, K. El Haddad, and T. Dutoit, “The Theory behindControllable Expressive Speech Synthesis: A Cross-DisciplinaryApproach,” in

Human-Computer Interaction . IntechOpen, 2019.[Online]. Available: http://dx.doi.org/10.5772/intechopen.89849[2] N. Tits, “A Methodology for Controlling the Emotional Expres-siveness in Synthetic Speech - a Deep Learning approach,” in , 2019, pp. 1–5.[3] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen,P. Nguyen, R. Pang, I. L. Moreno et al. , “Transfer Learning from https://github.com/CorentinJ/Real-Time-Voice-Cloning Figure 2:

ICE-Talk web interface. It is constituted of a textﬁeld, an image representing a latent space of vocal variabilitycontained in the dataset, and an audio player to listen to thesynthesized utterance.

Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis,” arXiv preprint arXiv:1806.04558 , 2018.[4] C. Jemine et al. , “Master thesis: Automatic multispeaker voicecloning,” 2019.[5] H. Tachibana, K. Uenoyama, and S. Aihara, “Efﬁciently train-able text-to-speech system based on deep convolutional networkswith guided attention,” in . IEEE, 2018,pp. 4784–4788.[6] N. Tits, F. Wang, K. E. Haddad, V. Pagel, and T. Dutoit,“Visualization and Interpretation of Latent Spaces for ControllingExpressive Speech Synthesis through Audio Analysis,” in