ICE-Talk: an Interface for a Controllable Expressive Talking Machine
IICE-Talk: an Interface for a Controllable Expressive Talking Machine
No´e Tits , Kevin El Haddad , Thierry Dutoit
Numediart Institute, University of Mons { noe.tits, kevin.elhaddad, thierry.dutoit } @umons.ac.be Abstract
ICE-Talk is an open source web-based GUI that allows the useof a TTS system with controllable parameters via a text fieldand a clickable 2D plot. It enables the study of latent spaces forcontrollable TTS. Moreover it is implemented as a module thatcan be used as part of a Human-Agent interaction. Index Terms : expressive, controllable, speech synthesis,human-computer interaction, deep learning, web interface
1. Introduction and Motivations
Speech Synthesis is an important component of Human-RobotInteraction [1]. However as of today, expressiveness in speechgenerated by Text-to-Speech (TTS) systems is under-exploredin such interactions. The reason is the difficulty of accessing thevariables controlling speech expressiveness in a deep learning-based TTS system [2].To tackle this problem we propose a tool allowing the con-trol of these variables through a graphical interface, thus con-tributing to the democratization of the use of Deep Learning(DL)-based TTS systems in Human-Agent interaction (HAI)applications. The goal of this tool is to be generic enough tobe connected in any HAI application.This interface allows for the control over the synthesis pa-rameters of a DL-based model through its latent space directlyand intuitively in a graphical way. It therefore allows the imple-mentation of several interesting applications and experimentssuch as listening tests for the evaluation of such systems thanksto easy prototyping of experiments. An example of experimentis available. This tool also enables the possibility of studyingthe impact of expressive synthesized speech in Human-Robotinteraction.
2. Related Work
As of today, there are some web interfaces allowing the use ofDL TTS models . They allow to write text, that is sent to themodel and get the synthesized speech as an audio object thatone can listen. The text is therefore the only control variablethat we can access.Recently, an interface allowing to give an audio for TTSwith speaker characteristics was developed based on the re-search of Tacotron team [3, 4]. It allows to select a referenceaudio file and synthesize speech from text imitating the voice ofthe reference. It is however not possible to interact with a latentspace representing acoustic variability.In this paper we provide a web interface capable of visual-izing and exploring a space of voice expressiveness and synthe-size corresponding expressive speech. https://github.com/noetits/ICE-Talk https://github.com/keithito/tacotron https://github.com/CorentinJ/Real-Time-Voice-Cloning
3. Description of ICE-Talk
Figure 1 depicts the different components of the system ar-chitecture. It is constituted of a DL unsupervised TTS modeltrained on an expressive dataset (see Section 3.2).To make the model available as a web service and com-municate information of text, audio and style between the webinterface and the TTS model, the Falcon Web framework isused.Falcon allows to bridge the gap between a python code anda web interface, allowing the use of Deep Learning frameworksthrough a web application.Figure 1: System architecture
As a use case, we use a modified version of
Deep ConvolutionalText-to-Speech (DCTTS) [5], a state-of-the art Deep-LearningSequence-to-Sequence (seq2seq) model with an output control-lable through a Latent Space designed to represent variations invoice style as described in [6].A TTS seq2seq model typically consists of an encoder-decoder structure. Text is encoded as a latent representation https://falcon.readthedocs.io/en/stable/ a r X i v : . [ ee ss . A S ] A ug hat is then decoded with an attention based decoder to predicta mel-spectrogram later inverted to an audio waveform.To obtain a voice style representation in [6], a mel-spectrogram encoder is added. It consists of a stack of 1D con-volutional layers, followed by an average pooling. This oper-ation ensures to obtain time-invariant information. It can thuscontain information about statistics of prosody such as pitch av-erage, average speaking rate, but not a pitch evolution. The interface contains a 2D representation of a latent spacewhich is an internal representation of the data distribution by thenetwork. This 2D representation is obtained via a dimensional-ity reduction applied to the highly dimensional latent space ofthe system. The interface also contains a text box for the sys-tem’s input and an audio player for the system’s output.The latent space represents the distribution of some control-ling parameters (the expressiveness for instance) of the outputspeech, and is obtained after training.By writing a text and clicking on a point on the 2D space,an audio signal is generated with the parameters values corre-sponding to the point clicked on.The web interface is implemented in HTML5 and javascriptto use the service.There are several possibilities for dimensional reduction :UMAP, PCA or t-SNE. The click of the mouse is detected usingjavascript in pixels coordinates and mapped to the reduced dataspace.Then Nearest Neighbour regression is used to compute the2D data point, and a lookup table gives the corresponding 8Dpoint of the latent space. The text and the 8D vector are fedto the model that generates the sentence and save it into a wavfile. The audio wav file is then served and played as an HTML5audio object.
4. Conclusions and Future Works
We presented an innovative interface for Controllable Expres-sive TTS called ICE-Talk that show a proof of concept of re-search results previously presented at Interspeech. It is opensource and ready to be used with available pre-trained models.As future works, other models available such as a multi-speaker TTS could be integrated to be able to generate speechfrom different speakers, not based on references but on the la-tent space describing speaker characteristics.
5. Acknowledgements
No´e Tits is funded through a FRIA grant (Fonds pour la Forma-tion `a la Recherche dans l’Industrie et l’Agriculture, Belgium).
6. References [1] N. Tits, K. El Haddad, and T. Dutoit, “The Theory behindControllable Expressive Speech Synthesis: A Cross-DisciplinaryApproach,” in
Human-Computer Interaction . IntechOpen, 2019.[Online]. Available: http://dx.doi.org/10.5772/intechopen.89849[2] N. Tits, “A Methodology for Controlling the Emotional Expres-siveness in Synthetic Speech - a Deep Learning approach,” in , 2019, pp. 1–5.[3] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen,P. Nguyen, R. Pang, I. L. Moreno et al. , “Transfer Learning from https://github.com/CorentinJ/Real-Time-Voice-Cloning Figure 2:
ICE-Talk web interface. It is constituted of a textfield, an image representing a latent space of vocal variabilitycontained in the dataset, and an audio player to listen to thesynthesized utterance.
Speaker Verification to Multispeaker Text-To-Speech Synthesis,” arXiv preprint arXiv:1806.04558 , 2018.[4] C. Jemine et al. , “Master thesis: Automatic multispeaker voicecloning,” 2019.[5] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently train-able text-to-speech system based on deep convolutional networkswith guided attention,” in . IEEE, 2018,pp. 4784–4788.[6] N. Tits, F. Wang, K. E. Haddad, V. Pagel, and T. Dutoit,“Visualization and Interpretation of Latent Spaces for ControllingExpressive Speech Synthesis through Audio Analysis,” in