SPEAK WITH YOUR HANDS Using Continuous Hand Gestures to control Articulatory Speech Synthesizer
SSPEAK WITH YOUR HANDSUsing Continuous Hand Gestures to control Articulatory Speech Synthesizer
Pramit Saha , Debasish Ray Mohapatra , Sidney Fels HCT Lab, Department of Electrical and Computer Engineering, University of British Columbia [email protected], [email protected], [email protected]
Abstract
This work presents our advancements in controlling an ar-ticulatory speech synthesis engine, viz., Pink Trombone, withhand gestures. Our interface translates continuous finger move-ments and wrist flexion into continuous speech using vocal tractarea-function based articulatory speech synthesis. We use Cy-berglove II with 18 sensors to capture the kinematic informa-tion of the wrist and the individual fingers, in order to controla virtual tongue. The coordinates and the bending values ofthe sensors are then utilized to fit a spline tongue model thatsmoothens out the noisy values and outliers. Considering theupper palate as fixed and the spline model as the dynamicallymoving lower surface (tongue) of the vocal tract, we compute1D area functional values that are fed to the Pink Trombone,generating continuous speech sounds. Therefore, by learningto manipulate one’s wrist and fingers, one can learn to producespeech sounds just through one’s hands, without the need forusing the vocal tract.
Keywords: articulatory speech synthesizer, pink trombone,Cyberglove II, hand gestures, continuous vowel synthesis,kinematics-to-articulatory mapping, silent speech interface
1. Introduction
Articulatory speech synthesis (Saha et al. 2018) encompassesthe production of speech sounds using a vocal tract model andsimulating the movements of the speech articulators like tongue,lips, velum etc. Articulatory vocal tract modelling targets thesimulation of the process of speaking by recreating the be-haviour of the human speech apparatus.Among the different articulators, tongue is the most dy-namic as well as the most significant part of vocal tract. Hence,in this work, we attempt to control the tongue movements viawrist and finger movements, targeting vocal sound synthesis.This alters the anterior part of the upper airway, thereby mod-ulating sound propagation through it. The goal is to develop aconvenient, easy-to-learn and intuitive physical interface withimproved control, leveraging the multi DOF capability of hand.
2. Related previous works
Recently, we have performed three interesting projects to syn-thesize speech from hand movements through articulatory andacoustic pathways. These three interfaces developed the back-ground for the current project and hence are worth mentioninghere.The first one was about developing an interface namedSOUND STREAM, involving five degree-of-freedom (DOF)mechanical control of a two dimensional, mid-sagittal human tongue-like structure (spring-based) for articulatory speech syn-thesis. As a demonstration of the project, the user was able tolearn to produce a range of sounds, by varying the shape andposition of the upper surface of the tongue-like structure in 2Dspace through a set of three sliders mounted on movable plat-form. The magnitude and frequency of the glottal excitationwas controlled physically by two additional sliders. This entirearrangement allowed the user to play around with five slidersto vary the articulatory structures as well as the source acousticparameters, exploring the variation of sounds.The second version of the project was about developing an-other interface for articulatory speech synthesis named SOUNDSTREAM II (Saha et al. 2018)- involving four DOF mechani-cal control of a two dimensional, mid-sagittal tongue througha biomechanical toolkit called ArtiSynth and a sound synthesisengine called JASS. As a demonstration of the project, the userlearnt to produce a range of JASS vocal sounds, by varying theshape and position of the ArtiSynth tongue in 2D space througha set of four muscle excitors modeled using force-based sen-sors. This variation was computed in terms of Area Functionsin ArtiSynth and communicated to the JASS audio-synthesizercoupled with two-mass glottal excitation model to complete thisend-to-end gesture-to-sound mapping.The goal of the third project was to develop a formant basedvowel sound synthesizer (Liu et al. 2019) using CyberGlove asan input device to map continuous hand gestures (wrist flexionand extension; finger abduction and adduction) to English vow-els. The interface enabled the user to control his wrist and fingermovements (in a 1D + 1D control space) in order to synthesisea continuous vowel space (using first and second formants) eas-ily and intuitively. This was motivated from the implementationof another adaptive speech interface named Glove Talk II (Felsand Hinton 1998), which achieved a neural network based map-ping between continuous hand gestures and control parametersof a formant based speech synthesiser.
3. Data collection
We use CyberGlove II, manufactured by Immersion Inc., hav-ing 18 sensors (3 on each finger and 3 on the wrist), to capturehand movements. During the experiment, the arm of the partic-ipant is fixed at all instants. The participant is asked to performwrist flexion and extension in order to decrease and increase theradius ’ R ’ as well as perform radial and ulnar deviation in orderto change the angle ’ θ ’ as shown in Fig 1. This is captured bythe wrist sensors. Further the tip sensor of each of the 5 fingersis used to collect data on modulation of the shape of the tonguearticulator.In the base position, the wrist is kept horizontal w.r.t the armwhich is mapped to the mid-radius location in the vowel trianglein the Fig 1. Wrist flexion implies increase in jaw opening and a r X i v : . [ c s . S D ] F e b igure 1: The proposed pink trombone control paradigmtongue flattening where as, wrist extension implies decrease injaw opening and the tongue elevation. The radial and ulner wristdeviations correspond respectively to retraction and protrusionsof the tongue. The wrist is mainly used to make the front andback as well as high and low vowel sounds.Now, in order to accommodate few other changes in thetongue shape, we use the finger movements. In the base tongueshape, the fingers remain stationary in a gesture of gripping aball as shown in Fig 1. Further, the fingers undergo extensionaccordingly, in order to change the tongue shape in a desiredmanner. For example, the elevation of the pinky finger to thecorrect degree can bring about alveolar fricatives like /s/, /z/,while particular elevation of the ring finger can produce palato-alveolar fricatives and correct elevation of the thumb can pro-duce velar plosives like /k/, /g/ and so on.
4. Computation of area function
The coordinates of the upper palate remain pre-specified andstatic. The sensor recordings from the wrist and the five fin-gers are used to fit a tongue spline model that changes dynami-cally with the changing tongue shape and position. The differ-ence between the coordinates on the spline surface and the fixedpalate is used as the diameter to compute the 1D oral tract areafunctional values as depicted in the literature (Mathur and Story2003). These area functions quantify the vocal tract shape.
5. Articulatory speech synthesis engine
The area functions are spatially sampled such that the vocaltract can be modelled as a series of adjoining cylindrical tubeelements. The physics-based articulatory speech synthesizersmodel the acoustic wave propagation through these elements.We feed the area functional values to the Articulatory speechsynthesis engine, named Pink Trombone (Thapen 2017), an on-line voice synthesis based web application that presents an in-teractive mid-sagittal view of human vocal tract and can be ma-nipulated by users to simulate various vocal sounds. Out ofnumerous methods, the digital waveguide-based acoustic wavesolvers can precisely compute acoustic wave propagation withan improved time performance. This interface maps the 1DArea functions to the acoustic space dynamically by forminga representation of such wave propagation. It also uses a glottalwave derived from LF-model to induce the acoustic energy.
6. Continuous vocal sound generation
The user can constantly manipulate his wrist and finger move-ments to make the desired tongue shape and position. This dy-namically changes the area function. The ability of Pink Trom-bone to continuously map the varying area functions allows thegeneration of a connected chain of speech sounds forming con-tinuous utterances. This interface particularly enables the userto generate all the vowels, oral stop consonants, affricates, frica-tives as well as approximants. Learning to control the fingersproperly over time will enable the user to continuously synthe-size meaningful words and sentences using these components.
7. Discussion and Conclusion
This work aims at developing a better physical user-interfacethrough multi-degree of freedom control arrangement of humanwrist and fingers. It can be easily extended to control the vocalfold parameters (like pitch), the lip movements and the nasali-sation by using more DOF control strategies. This would leadto incorporation of more features into the interface and wouldmake it more naturally sounding.However, one of the main fundamental questions that stillremains is, how to quantify the usability, effectiveness and effi-ciency of the proposed interface. In other words, how to designa Fitts’ task with varying level of difficulties which will helpus find an equivalent index of difficulty or the information ratein bits/second required by the proposed hand gesture controlparadigm. We have observed that making fricatives like /s/, /z/is a much more difficult task (where it needs better precisionregarding placement of the tongue tip - at a particular distancevertically downwards from the hard palate) than making stopconsonants like /t/ and /d/ (where the tongue tip or body has tostrike anywhere within a wider range of positions directly onthe hard palate). Taking into account these considerations, wewill continue to investigate how difficult it is for the wrist andfinger trajectories to make similar sounds and thereby will getbetter insights on designing proper experiments for user-study.
8. Acknowledgements
This work was funded by the Natural Sciences and EngineeringResearch Council (NSERC) of Canada.
9. References
Fels, Sidney S and Geoffrey E Hinton (1998). “Glove-TalkII-a neural-network interface which maps gestures to parallel formant speechsynthesizer controls”. In:
IEEE transactions on neural networks
MAPPING A CONTINUOUS VOWEL SPACETO HAND GESTURES . DOI : .Mathur, Siddharth and Brad H Story (2003). “Vocal tract modeling: Im-plementation of continuous length variations in a half-sample de-lay Kelly-Lochbaum model”. In: Signal Processing and Informa-tion Technology, 2003. ISSPIT 2003. Proceedings of the 3rd IEEEInternational Symposium on . IEEE, pp. 753–756.Saha, Pramit, Debasish Ray Mohapatra, Praneeth SV, and Sid-ney Fels (2018). “Sound-Stream II: Towards Real-Time Ges-ture Controlled Articulatory Sound Synthesis”. In: arXiv preprintarXiv:1811.08029 .Thapen, Neil (2017).
Pink Trombone . https : / / dood . al /pinktrombone/https : / / dood . al /pinktrombone/