Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning
LLaughter Synthesis:Combining Seq2seq modeling with Transfer Learning
No´e Tits , Kevin El Haddad , Thierry Dutoit
Numediart Institute, University of Mons { noe.tits, kevin.elhaddad, thierry.dutoit } @umons.ac.be Abstract
Despite the growing interest for expressive speech synthe-sis, synthesis of nonverbal expressions is an under-exploredarea. In this paper we propose an audio laughter synthesis sys-tem based on a sequence-to-sequence TTS synthesis system.We leverage transfer learning by training a deep learning modelto learn to generate both speech and laughs from annotations.We evaluate our model with a listening test, comparing its per-formance to an HMM-based laughter synthesis one and assessthat it reaches higher perceived naturalness. Our solution is afirst step towards a TTS system that would be able to synthesizespeech with a control on amusement level with laughter integra-tion.
Index Terms : Laugh synthesis, speech synthesis, TTS, deeplearning, transfer learning
1. Introduction and Motivations
Given the progress in speech technologies and Human-AgentInteractions (HAI), several applications of voice assistants andvirtual agents have been developed. These applications areevolving towards breaking the barriers between robot-soundingsynthetic sounds to human-like conversations. One of theunder-explored domains is the synthesis of nonverbal conver-sational expressions, particularly laughter. Laughter is an im-portant component of speech and daily interactions. It has beenshown to be very frequent cross-cultural expression in conver-sations, to communicate emotions and to have conversationaland social functionalities [1, 2, 3].Laughter can be expressed in many different ways and isparticular to each individual. It is therefore rather difficult tocollect naturalistic genuine laughter in a sound clean environ-ment. This is the main reason why resources available for syn-thesis purposes are rather limited compared to speech.In this paper we present a deep learning-based laughter syn-thesis system. This work is part of a larger project aiming atsynthesizing laughter alongside speech in HAI systems. Aninitial system was trained from scratch using the same deep-learning based model described here and only laughter data.The results obtained were not satisfying. This was probablydue to the limited amount of data available for training. It thusmotivated the use of speech data as well. Indeed, although dif-ferent, both speech and laughs share common sound charac-teristics since laughs are sequences of fricatives, vowel-soundsand breathing. So, we leverage the knowledge learned by TTSsystems when trained with speech in order to improve laughtersynthesis.The paper is organized as follows: related work is summa-rized in Section 2; Section 3 presents the datasets involved inthis work; Section 4 describes the proposed system for audiolaughter synthesis; the procedure of the perceptive evaluation is described in Section 5; the results of the evaluation are pre-sented and discussed in Section 6; finally we conclude and de-tail our plans for future work in Section 7. Most of the resourcesused here are accessible .
2. Related Work
The techniques studied for laughter synthesis have generallyfollowed those used for speech synthesis. In this work we willdo the same for two main reasons: first, the signals share a lot ofcommon characteristics; second, using TTS systems will allowto incorporate our laughter synthesis system into a fully func-tioning TTS system and thus achieve our ultimate goal of TTSwith control over amusement levels.Speech synthesis methods can be grouped in three main cat-egories: synthesis by concatenation, parametric synthesis andstatistical parametric synthesis [4]. Among the few studies onlaughter synthesis, the first attempts included techniques likesynthesis by diphone concatenation [5], parametric synthesisand by using a mass-spring approach [6]. Then Hidden MarkovModels (HMM)-based models were introduced to laughter syn-thesis due to their wide-use in speech synthesis back then [7].In our previous work we used an HMM-based approach tosynthesize laughter alongside amused speech in [8]. Syntheticlaughs were obtained by training HMM systems on laughterdata. The amused speech was obtained by adapting HMMstrained with another speaker’s neutral speech data and adaptingthem to a smaller dataset containing speech from the speakerfrom which the laughs were recorded (due to the limited amountof data available for that speaker).However, deep learning-based laughter synthesis has beenlittle explored yet. A recent approach of synthesis with wavenetwas recently proposed by [9]. Wavenet [10] is an autoregres-sive CNN synthesizing audio sample by sample from features,typically linguistic features for TTS, or acoustic features forvocoding. In their attempt of application to laughter synthe-sis, they conditioned the wavenet model on information of in-halation/exhalation sequence and parameters of durations andpower contour predicted by an HMM model. This approach istherefore still relying on previous HMM approaches for a partof the information.Given the breakthrough of sequence-to-sequence (seq2seq)approaches in speech synthesis systems, we propose an adaptedapproach for audio laughter synthesis. Along with the synthesisquality, the method proposed in this paper also offers controlover the specific sound sequences to be generated rather thansyllable-level control. This allows the flexibility of choosingthe specific sequence of voiced (vowels) and unvoiced soundsto be generated. Given the aforementioned goal of obtaininga fully functioning TTS system generating laughter alongside https://github.com/numediart/LaughterSynthesis a r X i v : . [ ee ss . A S ] A ug igure 1: Block diagram of the proposed method for model adaptation. The statistical parametric model is based on DCTTS anddescribed in Section 4.1. The text input showed on the figure is symbolic, its detailed format is described in Section 3.
Vowel [a] [e] [i] Totalnumber of samples 54 33 25 112duration (sec) 101 63 38 202Table 1:
Quantity of laughs per vowel context. speech, another advantage is not only to synthesize naturalistichuman-like laughs, but also to do it in a speech context.This will allow a later integration in a fully functioningspeech and laugh synthesis system as planned.
3. Dataset
In order to apply the transfer learning approach describedabove, the data used are formed of subsets of a proprietarydataset recorded by Acapela and of the AmuS dataset [11].Acapela’s dataset was recorded to build a narrating frame-work to construct book recordings from transcriptions. It con-tains phonetically rich sentences uttered by a male actor in USEnglish. The actor was asked to utter a set of the sentencesin 8 style classes. For the purpose of this work, only the au-dio recordings of the neutral style were kept along of the cor-responding transcription with a total of 150.50 minutes (3299utterances) of speech data.The AmuS dataset contains recordings of amused speechcomponents such as smiled speech, laughs and speech-laughs.For this work, we chose the speaker with the most amount ofrecorded laughs: SpkB. In this dataset, the purpose of the laughswere to be inserted in speech in order to create an amused ef-fect. This suits well with the goal of generating laughs along-side speech as mentioned above.In order to record these, the subject was asked to watchstimuli of funny content while sustaining the sound of a vowel,until eventually laughter occurred, naturally interrupting thevowel. This would allow us to collect laughs with transitionsfrom and to vowels. We thus have at our disposal laughs occur-ring in three vowel contexts: [a], [e], [i] (French IPA symbols).Table 1 breaks down amount of data available. Isolated laughs are sequences of voiced and unvoicedsounds. In AmuS, the laughs were segmented and each seg-ment was given a label corresponding to a voiced or unvoicedcategory. These label sequences and the corresponding laughteraudio signals were used to train our systems.
4. Seq2seq Audio Laughter synthesis
Nowadays, one of the major techniques for Text-to-Speech syn-thesis are deep learning architectures based on the sequence-to-sequence (seq2seq) principle. It consists of an encoder-decodersetup with an interface between the two components called
At-tention Mechanism whose role is to model the alignment be-tween input and output sequences. Well known seq2seq TTSsystems are Tacotron [12], Char2wav [13] and DCTTS [14]. Inthis work we adapted the DCTTS model for audio laughter syn-thesis using an open implementation available online .In this work, the input sequence is composed of speechphonemes and laughter annotations described in Section 3.We use festival [15] to extract phones from transcriptions inthe Acapela dataset and the speech part of AmuS dataset.The output sequence is a mel-spectrogram. A second part ofDCTTS, trained separately reconstructs a full resolution magni-tude spectrogram from the mel-spectrogram to be inverted to awaveform using Griffin-Lim algorithm [16]. For more detailsabout these, see [14].A first Text-To-Speech and Laughter is trained using thespeech from the Acapela dataset due to the quantity of datait provides along with the laughs and smiled speech from thespeaker SpkB of the AmuS dataset. This system is then fine-tuned with smiled speech and laughter, both coming from thesame AmuS speaker (SpkB). This was done with the perspec-tive of integrating this system into a fully functioning TTS sys-tem with control over amusement, as mentioned previously.This adaptation technique was previously tested on emo-tional speech and showed promising results in [17], which mo-tivated its use in this work. https://github.com/CSTR-Edinburgh/ophelia igure 1 shows a block diagram of the procedure proposedfor model adaptation and waveform correction with MelGAN. To generate the waveform from acoustic features, it has beenshown that neural audio synthesizers achieve better quality interms of naturalness [10, 18]. However it is generally a chal-lenge to design and optimize such models efficiently to reachthe expected results described in the literature. They also of-ten lose generalization properties compared to signal processingbased vocoders, as they are often speaker dependent.MelGAN [19] is a recently proposed model that tackled theproblems of efficiency and generalization accross speakers. Themodel is non-autoregressive, fully convolutional and smallerthan previous ones.In this paper we use MelGAN as a waveform corrector.The laughter waveform is first synthesized by the system de-scribed in Section 4.1. This waveform contains artifacts due tothe Griffin-Lim estimation. Then we apply analysis and synthe-sis with MelGAN to obtain a corrected laughter waveform that,we show, is perceived as more natural compared to not usingMelGAN.
5. Perception Test
To evaluate the obtained results we set-up a perception test us-ing a Mean Opinion Score (MOS) test. We gathered the samplesfrom these different methods:• Method 1: original laughter samples from AmuS dataset(SpkB) which will serve as a top-line of naturalness forthe different methods.• Method 2: Synthesized laughs based on the HTS systemequivalent to the one used in [8]• Method 3: seq2seq model (seq2seq-GL) described inSection 4.1.• Method 4: same seq2seq model as method 3 followed bythe proposed MelGAN waveform correction (seq2seq-MelGAN)The MOS test focused on evaluating the naturalness of thesynthesized samples of the methods. It was implemented asa web experiment with turkle , which is an open-source webserver with which one can host a crowdsourcing application lo-cally,A total of 71 samples were presented to 24 participants ina random order (the participants characteristics are detailed inTable 2)). They were asked to rate each sample in terms ofnaturalness on a 5-point Likert scale with the following labels:very unnatural (score 1), unnatural (2), fairly natural (3), natural(4) and very natural (5). They could listen to each sample asmuch times as needed and could stop at any point of the test.The definition of naturalness for speech and laughter couldbe interpreted in different ways during evaluation. For exampleit would have been possible that a participant would rate theacting quality instead of human-likeness of the sound perceivedduring the listening test. Indeed, ”not natural” can be perceivedas ”fake” or ”simulated” instead of ”synthetic”. But in orderto not deviate from previous work and evaluations that used theword ”natural”, we preferred using it and specify what we meanby it. https://github.com/hltcoe/turkle Female Male Sum[20,40[ 7 13 20[50,65[ 1 3 4Sum 8 16 24Table 2:
Number of participants by gender and age range (inyears)
Number of collected ratings, MOS scores and theirstandard deviation for each method
That is why, we added an explanation of the meaning of”natural” in the question asked, focusing on the definition of”human-likeness” for ”natural”.We also asked some of the participants at the end of thetest to comment on what mainly influenced their choices. Thismainly serves as our qualitative testing.
6. Results
In this section we present the results obtained from the percep-tion test. These are divided in quantitative and qualitative anal-yses. The former presents the scores obtained along with in-terpretations while the latter reports on comments made by theparticipants on what influenced their choices,
A total of 1696 answers were collected. Table 3 gathers thenumber of ratings for each method, the resulting MOS scoresand their standard deviation.Figure 2 shows the distributions of the ratings as box-plots. Figure 3 shows the percentage of scores chosen for eachmethod.Figure 2:
Boxplots of scores distributions of the different meth-ods. The green lines correspond to the Mean Opinion Scoreswhile the red lines show the median values of the scores. igure 3:
Score distributions of the different methods.The scale ranges from 1:very unnatural to 5: very natural.
Original samples of the dataset reach a MOS score that isbelow 5 as listeners do not rate all original samples as perfect.There is a quite high variance, that is close to one for all meth-ods, which was also the case in [7].We can clearly see from the results obtained that ourseq2seq-MelGAN outperformed both the seq2seq-GL and theHMM-based systems. Both seq2seq-GL and HMM-based sys-tem obtained similar MOS. We interpret these results by at-tributing the loss in MOS obtained, in the latter 2 systemsgenerated laughs, to the distortions found in the seq2seq-GLlaughs and the more robotic effect found in the HMMs-obtainedlaughs. We base our interpretation on previously reported re-sults as detailed in what follows, but also on our personal obser-vations that were not officially tested yet and ones reported bythe MOS tests participants as detailed in Section 6.2,This shows that the seq2seq is efficient at synthesizinglaughter, but the question asked in the MOS test to grade nat-uralness (human-likeness) combined with the distortion gener-ated by the Griffin-Lim algorithm degraded the grades obtainedfor this test.Although MOS tests are never executed exactly the sameway, it is always interesting to analyze and compare the resultsobtained here with respect to the results of previous similar ex-periments found in the literature. This will also partly back ourresults interpretation.In [19], the authors compared MelGAN vocoder to theGriffin-Lim algorithm for seq2seq TTS. They reported theGriffin-Lim algorithm to be responsible for a large part of thedistortion leading to a loss of 2.95 MOS points compared tooriginal samples. MelGAN, on the other hand, offered a gainof 1.77 points of MOS over the Griffin-Lim algorithm. In thiswork, the MelGAN waveform correction offered a gain of 0.78points of MOS. It is important to highlight the fact that train-ing MelGAN directly on generated spectrograms will likely im-prove our results which was not done in this study but will bepart of future work. In [7], the authors compared different vari-ants of HMM-based laughter synthesis. They show that the dis-tortion caused by the vocoder in the copy-synthesis samples isof 0.8 compared to original samples. Their best synthesis so-lution is 0.6 points below that, and therefore 1.4 below origi-nal samples. In this paper, the HMM approach is 1.46 beloworiginal samples which is close to their results. In [20], theyconfirm that HMM-based laughter synthesis have significantly lower quality than copy-synthesis with several vocoders.
A part of the participants were asked to comment on what, ac-cording to them, influenced their choices in rating the laughsduring the test. In this section, we summarise the obtained qual-itative results. We report these comments in order to shed abit more light on the results obtained from the quantitative test.Here is a list of the what some of the participants reported:• The duration seemed to be another important parameterto consider, as some participants seemed to have basedtheir choice on it. Indeed they reported that some laughswere too long/short to be natural. Also, some partici-pants found short laughs to be ambiguous on a natural-ness scale, as if they were focused on finding the fakelaughs from the real ones instead of solely focusing grad-ing how natural they perceive the laugh they were listen-ing to.• Laughs with varying pitch and duration seemed to beperceived as more natural as opposed to laughs with amonotonous prosody. Some laughs were described as”sounding like a repeated sequence” and were perceivedas robotic (these would correspond to the HMM gener-ated laughs) whereas laughs containing more random-ness were perceived as more natural. Similarly, the onesof which the loudness was fading off (decreasing untilthe end) were perceived as more natural than ones witha monotonous loudness level and ending abruptly.Apart from the main parameters influencing degrading de-cision during the MOS test, we note how important the choiceof the question and phrasing is for subjective evaluations suchas these.
7. Conclusions and Future Work
In this paper, a new approach of audio laughter synthesis basedon seq2seq learning was proposed inspired by the evolution ofthe TTS field. This system is implemented by leveraging thepatterns learned to pass from text to acoustic features in speech,to learn laughter synthesis.We also use a pretrained MelGAN model as a post wave-form corrector allows to remove audio artifacts generated byGriffin-Lim algorithm and thus improve the scores obtained ina MOS test. We believe several modifications could improve theacoustic quality of the synthesis. First end-to-end training couldhelp concerning the accumulation of errors of several blocks:the seq2seq system and the vocoder.This results in a strong improvement over past methods ofaudio laughter synthesis (including our own [8]) in terms of nat-uralness and is promising for later use to build amused speechsynthesis systems. The promising results obtained here, allowsus to work on incorporating the laughter synthesis system into afully functioning TTS with control over amusement level. Thefact that our laughter synthesis system was developed in a TTScontext makes this integration easier.
8. Acknowledgements
No´e Tits is funded through a FRIA grant (Fonds pour la Forma-tion `a la Recherche dans l’Industrie et l’Agriculture, Belgium). . References [1] K. Laskowski and S. Burger, “Analysis of the occurrence of laugh-ter in meetings,” in
Eighth Annual Conference of the InternationalSpeech Communication Association , 2007.[2] L. Devillers and L. Vidrascu, “Positive and negative emotionalstates behind the laughs in spontaneous spoken dialogs,” in
Inter-disciplinary workshop on the phonetics of laughter , 2007, p. 37.[3] M. Soury and L. Devillers, “Smile and laughter in human-machine interaction: a study of engagement.” in
LREC , 2014, pp.3633–3637.[4] N. Tits, K. El Haddad, and T. Dutoit, “The Theory behindControllable Expressive Speech Synthesis: A Cross-DisciplinaryApproach,” in
Human-Computer Interaction . IntechOpen, 2019.[Online]. Available: http://dx.doi.org/10.5772/intechopen.89849[5] E. Lasarcyk and J. Trouvain, “Imitating conversational laughterwith an articulatory speech synthesizer,” in
Proc. InterdisciplinaryWorkshop on the Phonetics of Laughter , 2007.[6] S. Sundaram and S. Narayanan, “Automatic acoustic synthesis ofhuman-like laughter,”
The Journal of the Acoustical Society ofAmerica , vol. 121, no. 1, pp. 527–535, 2007.[7] J. Urbain, H. akmak, and T. Dutoit, “Evaluation of HMM-basedlaughter synthesis,” in , 2013, pp. 7835–7839.[8] K. El Haddad, S. Dupont, J. Urbain, and T. Dutoit, “Speech-laughs: An HMM-based Approach for Amused Speech Synthe-sis,” in
Internation Conference on Acoustics, Speech and SignalProcessing (ICASSP 2015) , Brisbane, Australia, 2015, pp. 4939–4943.[9] H. Mori, T. Nagata, and Y. Arimoto, “Conversational and SocialLaughter Synthesis with WaveNet,” in
Proc. Interspeech 2019 ,2019, pp. 520–523. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2019-2131[10] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu,“WaveNet: A Generative Model for Raw Audio,” in
SSW , 2016.[11] K. El Haddad, I. Torre, E. Gilmartin, H. C¸ akmak, S. Dupont,T. Dutoit, and N. Campbell, “Introducing AmuS: The AmusedSpeech Database,” in
Statistical Language and Speech Process-ing , N. Camelin, Y. Est`eve, and C. Mart´ın-Vide, Eds. Cham:Springer International Publishing, 2017, pp. 229–240.[12] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le,Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron:Towards End-to-End Speech Synthesis,” in
INTERSPEECH ,2017.[13] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner,A. Courville, and Y. Bengio, “Char2wav: End-to-end speech syn-thesis,”
ICLR2017 workshop submission , 2017.[14] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently train-able text-to-speech system based on deep convolutional networkswith guided attention,” in . IEEE, 2018,pp. 4784–4788.[15] A. W. Black, P. Taylor, and R. Caley, “The festival speech synthe-sis system: system documentation,” 1997.[16] D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,”
IEEE Transactions on Acoustics, Speech,and Signal Processing , vol. 32, no. 2, pp. 236–243, 1984.[17] N. Tits, K. El Haddad, and T. Dutoit, “Exploring Transfer Learn-ing for Low Resource Emotional TTS,” in
Intelligent Systems andApplications , Y. Bi, R. Bhatia, and S. Kapoor, Eds. Cham:Springer International Publishing, 2020, pp. 52–60.[18] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman,and K. Kavukcuoglu, “Efficient neural audio synthesis,” pp.2410–2419, 2018. [19] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,J. Sotelo, A. de Br´ebisson, Y. Bengio, and A. C. Courville, “Mel-gan: Generative adversarial networks for conditional waveformsynthesis,” in
Advances in Neural Information Processing Sys-tems , 2019, pp. 14 881–14 892.[20] B. Bollepalli, J. Urbain, T. Raitio, J. Gustafson, and H. Cakmak,“A comparative evaluation of vocoding techniques for hmm-basedlaughter synthesis,” in2014 IEEE international conference onacoustics, speech and signal processing (ICASSP)