[PDF] SPEAK WITH YOUR HANDS Using Continuous Hand Gestures to control Articulatory Speech Synthesizer

Abstract

This work presents our advancements in controlling an articulatory speech synthesis engine, \textit{viz.}, Pink Trombone, with hand gestures. Our interface translates continuous finger movements and wrist flexion into continuous speech using vocal tract area-function based articulatory speech synthesis. We use Cyberglove II with 18 sensors to capture the kinematic information of the wrist and the individual fingers, in order to control a virtual tongue. The coordinates and the bending values of the sensors are then utilized to fit a spline tongue model that smoothens out the noisy values and outliers. Considering the upper palate as fixed and the spline model as the dynamically moving lower surface (tongue) of the vocal tract, we compute 1D area functional values that are fed to the Pink Trombone, generating continuous speech sounds. Therefore, by learning to manipulate one's wrist and fingers, one can learn to produce speech sounds just through one's hands, without the need for using the vocal tract.

Full PDF

SSPEAK WITH YOUR HANDSUsing Continuous Hand Gestures to control Articulatory Speech Synthesizer

Pramit Saha , Debasish Ray Mohapatra , Sidney Fels HCT Lab, Department of Electrical and Computer Engineering, University of British Columbia [email protected], [email protected], [email protected]

Abstract

This work presents our advancements in controlling an ar-ticulatory speech synthesis engine, viz., Pink Trombone, withhand gestures. Our interface translates continuous ﬁnger move-ments and wrist ﬂexion into continuous speech using vocal tractarea-function based articulatory speech synthesis. We use Cy-berglove II with 18 sensors to capture the kinematic informa-tion of the wrist and the individual ﬁngers, in order to controla virtual tongue. The coordinates and the bending values ofthe sensors are then utilized to ﬁt a spline tongue model thatsmoothens out the noisy values and outliers. Considering theupper palate as ﬁxed and the spline model as the dynamicallymoving lower surface (tongue) of the vocal tract, we compute1D area functional values that are fed to the Pink Trombone,generating continuous speech sounds. Therefore, by learningto manipulate one’s wrist and ﬁngers, one can learn to producespeech sounds just through one’s hands, without the need forusing the vocal tract.

Keywords: articulatory speech synthesizer, pink trombone,Cyberglove II, hand gestures, continuous vowel synthesis,kinematics-to-articulatory mapping, silent speech interface

1. Introduction

Articulatory speech synthesis (Saha et al. 2018) encompassesthe production of speech sounds using a vocal tract model andsimulating the movements of the speech articulators like tongue,lips, velum etc. Articulatory vocal tract modelling targets thesimulation of the process of speaking by recreating the be-haviour of the human speech apparatus.Among the different articulators, tongue is the most dy-namic as well as the most signiﬁcant part of vocal tract. Hence,in this work, we attempt to control the tongue movements viawrist and ﬁnger movements, targeting vocal sound synthesis.This alters the anterior part of the upper airway, thereby mod-ulating sound propagation through it. The goal is to develop aconvenient, easy-to-learn and intuitive physical interface withimproved control, leveraging the multi DOF capability of hand.

2. Related previous works

Recently, we have performed three interesting projects to syn-thesize speech from hand movements through articulatory andacoustic pathways. These three interfaces developed the back-ground for the current project and hence are worth mentioninghere.The ﬁrst one was about developing an interface namedSOUND STREAM, involving ﬁve degree-of-freedom (DOF)mechanical control of a two dimensional, mid-sagittal human tongue-like structure (spring-based) for articulatory speech syn-thesis. As a demonstration of the project, the user was able tolearn to produce a range of sounds, by varying the shape andposition of the upper surface of the tongue-like structure in 2Dspace through a set of three sliders mounted on movable plat-form. The magnitude and frequency of the glottal excitationwas controlled physically by two additional sliders. This entirearrangement allowed the user to play around with ﬁve slidersto vary the articulatory structures as well as the source acousticparameters, exploring the variation of sounds.The second version of the project was about developing an-other interface for articulatory speech synthesis named SOUNDSTREAM II (Saha et al. 2018)- involving four DOF mechani-cal control of a two dimensional, mid-sagittal tongue througha biomechanical toolkit called ArtiSynth and a sound synthesisengine called JASS. As a demonstration of the project, the userlearnt to produce a range of JASS vocal sounds, by varying theshape and position of the ArtiSynth tongue in 2D space througha set of four muscle excitors modeled using force-based sen-sors. This variation was computed in terms of Area Functionsin ArtiSynth and communicated to the JASS audio-synthesizercoupled with two-mass glottal excitation model to complete thisend-to-end gesture-to-sound mapping.The goal of the third project was to develop a formant basedvowel sound synthesizer (Liu et al. 2019) using CyberGlove asan input device to map continuous hand gestures (wrist ﬂexionand extension; ﬁnger abduction and adduction) to English vow-els. The interface enabled the user to control his wrist and ﬁngermovements (in a 1D + 1D control space) in order to synthesisea continuous vowel space (using ﬁrst and second formants) eas-ily and intuitively. This was motivated from the implementationof another adaptive speech interface named Glove Talk II (Felsand Hinton 1998), which achieved a neural network based map-ping between continuous hand gestures and control parametersof a formant based speech synthesiser.

3. Data collection

We use CyberGlove II, manufactured by Immersion Inc., hav-ing 18 sensors (3 on each ﬁnger and 3 on the wrist), to capturehand movements. During the experiment, the arm of the partic-ipant is ﬁxed at all instants. The participant is asked to performwrist ﬂexion and extension in order to decrease and increase theradius ’ R ’ as well as perform radial and ulnar deviation in orderto change the angle ’ θ ’ as shown in Fig 1. This is captured bythe wrist sensors. Further the tip sensor of each of the 5 ﬁngersis used to collect data on modulation of the shape of the tonguearticulator.In the base position, the wrist is kept horizontal w.r.t the armwhich is mapped to the mid-radius location in the vowel trianglein the Fig 1. Wrist ﬂexion implies increase in jaw opening and a r X i v : . [ c s . S D ] F e b igure 1: The proposed pink trombone control paradigmtongue ﬂattening where as, wrist extension implies decrease injaw opening and the tongue elevation. The radial and ulner wristdeviations correspond respectively to retraction and protrusionsof the tongue. The wrist is mainly used to make the front andback as well as high and low vowel sounds.Now, in order to accommodate few other changes in thetongue shape, we use the ﬁnger movements. In the base tongueshape, the ﬁngers remain stationary in a gesture of gripping aball as shown in Fig 1. Further, the ﬁngers undergo extensionaccordingly, in order to change the tongue shape in a desiredmanner. For example, the elevation of the pinky ﬁnger to thecorrect degree can bring about alveolar fricatives like /s/, /z/,while particular elevation of the ring ﬁnger can produce palato-alveolar fricatives and correct elevation of the thumb can pro-duce velar plosives like /k/, /g/ and so on.

4. Computation of area function

The coordinates of the upper palate remain pre-speciﬁed andstatic. The sensor recordings from the wrist and the ﬁve ﬁn-gers are used to ﬁt a tongue spline model that changes dynami-cally with the changing tongue shape and position. The differ-ence between the coordinates on the spline surface and the ﬁxedpalate is used as the diameter to compute the 1D oral tract areafunctional values as depicted in the literature (Mathur and Story2003). These area functions quantify the vocal tract shape.

5. Articulatory speech synthesis engine

The area functions are spatially sampled such that the vocaltract can be modelled as a series of adjoining cylindrical tubeelements. The physics-based articulatory speech synthesizersmodel the acoustic wave propagation through these elements.We feed the area functional values to the Articulatory speechsynthesis engine, named Pink Trombone (Thapen 2017), an on-line voice synthesis based web application that presents an in-teractive mid-sagittal view of human vocal tract and can be ma-nipulated by users to simulate various vocal sounds. Out ofnumerous methods, the digital waveguide-based acoustic wavesolvers can precisely compute acoustic wave propagation withan improved time performance. This interface maps the 1DArea functions to the acoustic space dynamically by forminga representation of such wave propagation. It also uses a glottalwave derived from LF-model to induce the acoustic energy.

6. Continuous vocal sound generation

The user can constantly manipulate his wrist and ﬁnger move-ments to make the desired tongue shape and position. This dy-namically changes the area function. The ability of Pink Trom-bone to continuously map the varying area functions allows thegeneration of a connected chain of speech sounds forming con-tinuous utterances. This interface particularly enables the userto generate all the vowels, oral stop consonants, affricates, frica-tives as well as approximants. Learning to control the ﬁngersproperly over time will enable the user to continuously synthe-size meaningful words and sentences using these components.

7. Discussion and Conclusion

This work aims at developing a better physical user-interfacethrough multi-degree of freedom control arrangement of humanwrist and ﬁngers. It can be easily extended to control the vocalfold parameters (like pitch), the lip movements and the nasali-sation by using more DOF control strategies. This would leadto incorporation of more features into the interface and wouldmake it more naturally sounding.However, one of the main fundamental questions that stillremains is, how to quantify the usability, effectiveness and efﬁ-ciency of the proposed interface. In other words, how to designa Fitts’ task with varying level of difﬁculties which will helpus ﬁnd an equivalent index of difﬁculty or the information ratein bits/second required by the proposed hand gesture controlparadigm. We have observed that making fricatives like /s/, /z/is a much more difﬁcult task (where it needs better precisionregarding placement of the tongue tip - at a particular distancevertically downwards from the hard palate) than making stopconsonants like /t/ and /d/ (where the tongue tip or body has tostrike anywhere within a wider range of positions directly onthe hard palate). Taking into account these considerations, wewill continue to investigate how difﬁcult it is for the wrist andﬁnger trajectories to make similar sounds and thereby will getbetter insights on designing proper experiments for user-study.

8. Acknowledgements

This work was funded by the Natural Sciences and EngineeringResearch Council (NSERC) of Canada.

9. References

Fels, Sidney S and Geoffrey E Hinton (1998). “Glove-TalkII-a neural-network interface which maps gestures to parallel formant speechsynthesizer controls”. In:

IEEE transactions on neural networks

MAPPING A CONTINUOUS VOWEL SPACETO HAND GESTURES . DOI : .Mathur, Siddharth and Brad H Story (2003). “Vocal tract modeling: Im-plementation of continuous length variations in a half-sample de-lay Kelly-Lochbaum model”. In: Signal Processing and Informa-tion Technology, 2003. ISSPIT 2003. Proceedings of the 3rd IEEEInternational Symposium on . IEEE, pp. 753–756.Saha, Pramit, Debasish Ray Mohapatra, Praneeth SV, and Sid-ney Fels (2018). “Sound-Stream II: Towards Real-Time Ges-ture Controlled Articulatory Sound Synthesis”. In: arXiv preprintarXiv:1811.08029 .Thapen, Neil (2017).

Pink Trombone . https : / / dood . al /pinktrombone/https : / / dood . al /pinktrombone/