A comparative study of two-dimensional vocal tract acoustic modeling based on Finite-Difference Time-Domain methods
AA comparative study of two-dimensional vocal tract acoustic modeling basedon Finite-Difference Time-Domain methods
Debasish Ray Mohapatra , Victor Zappi , Sidney Fels Electrical and Computer Engineering Department, University of British Columbia, Canada College of Arts, Media and Design, Northeastern University, USA [email protected], [email protected], [email protected]
Abstract
The two-dimensional (2D) numerical approaches for vo-cal tract (VT) modelling can afford a better balance betweenthe low computational cost and accurate rendering of acous-tic wave propagation. However, they require a high spatio-temporal resolution in the numerical scheme for a precise es-timation of acoustic formants at the simulation run-time ex-pense. We have recently proposed a new VT acoustic modellingtechnique, known as the 2.5D Finite-Difference Time-Domain(2.5D FDTD), which extends the existing 2D FDTD approachby adding tube depth to its acoustic wave solver. In this work,first, the simulated acoustic outputs of our new model are shownto be comparable with the 2D FDTD and a realistic 3D FEMVT model at a low spatio-temporal resolution. Next, a radiationmodel is developed by including a circular baffle around the VTas head geometry. The transfer functions of the radiation modelare analyzed using five different vocal tract shapes for vowelsounds /a/, /e/, /i/, /o/ and /u/.
Keywords: computational acoustics, vocal tract, FDTD, artic-ulatory speech synthesis
1. Introduction
Various articulatory models have been developed to approxi-mate the sophisticated upper vocal tract geometry and captureits acoustic features using a physics-based acoustic wave solver.The 3D acoustic analysis (Takemoto, Mokhtari, and Kitamura2010; Vampola et al. 2015) can precisely characterize the vocaltract’s spectral features while offering better geometrical flexi-bility. However, such models are not pragmatic for designingreal-time speech synthesizers due to their high computationalcost. In contrast to 3D, the 1D vocal tract models are computa-tionally lightweight. Nevertheless, their over-simplified repre-sentation of a tube structure substantially eliminates its geomet-rical details. Hence, the acoustic analysis of 1D models is notaccurate at higher formants.As an alternative, the 2D acoustic wave solvers (Arnelaand Guasch 2014; Speed, Murphy, and Howard 2009) repre-sent the complex vocal tract geometry as a 2D contour (a mid-sagittal cut of a 3D VT geometry) and capture the wave inter-action only along the mid-sagittal plane. Due to the dimension-ality reduction, the 2D models are lightweight compare to 3Dones. The most well-known 2D acoustic analysis methods forvocal tract modelling are the finite element method (FEM) andFDTD. Since 2D models do not lump off-plane wave interac-tion, direct use of cross-sectional areas from the vocal tract MRIimages to represent 2D contours inside a simulation grid yieldserroneous acoustic output. Alternatively, a nonlinear area func- tion transformation method needs to be implemented to tune theacoustic features of 2D models that can match the more com-plex 3D ones. However, this procedure kills the performanceboost of 2D models. Consequently, we have proposed a newvocal tract acoustic model (2.5D FDTD) that improves the 2Dmodel by capturing the wave interaction across the mid-sagittalplane. The next section briefly discusses the numerical imple-mentation of a 3D tube using 2D and 2.5D FDTD schemes.
2. Methodology
We implemented the FDTD numerical analysis technique to dis-cretize the 2D acoustic wave equation on a staggered Yee grid.The 2D Yee scheme consists of a rectangular grid, define acous-tic parameters (pressure ( p ( n ) ) and velocity ( v ( n ) x , v ( n ) y )) at eachgrid point, a squared 2D cell (Allen and Raghuvanshi 2015; Yee1966), as shown in Figure 1 . However, the employed techniquefor acoustic simulation does not facilitate dynamic boundaryconditions. Hence, a new scalar field ≤ β ( x, y, t ) ≤ wasintroduced to the solver, which could transit smoothly between β = 1 (air) and β = 0 (boundary). At β = 0 , a prescribed ve-locity v = v b was enforced to handle the boundary condition.Specifically, per each time step ( n ), the wave solver updatespressure component p ( n ) and velocity components ( v ( n ) x , v ( n ) y )at each grid point, by solving the below linear wave equationsin the time domain (Zappi et al. 2016): ∂p∂t + (1 − β ) p = − ρc ( ∂v x ∂x + ∂v y ∂y ) (1) β ∂ v ∂t + (1 − β ) v = − β ∇ pρ + (1 − β ) v b (2)The 2.5D FDTD follows the above rationale but improvesupon it by incorporating new impedance terms in its 2D acous-tic wave solver, known as tube depth . The tube’s depth ( D ) ,i.e., its continuous extension along the z axis (with x and y axisbeing the dimension of starting 2D scheme), is derived from thecross-sectional area of the tube and sampled at each point of thescheme as shown in Figure 1 . Then the resulting depths aremapped to their respective acoustic parameters ( p , v x , v y ) asdemonstrated by Mohapatra, Zappi, and Fels (2019). The in-clusion of tube depth allows the wave solver to apprehend waveinteractions across the mid-sagittal plane, in turn eliminatingthe nonlinear cross-sectional area functions transformation re-quirement. Therefore, it is expected that the time complexityof a 2.5D FDTD tube model will be better than the existing 2Dmodels. Basically, with the help of following discrete updaterules, we adopt a time marching algorithm (see Algorithm 1 )to sample the acoustic parameters at every grid point: a r X i v : . [ c s . S D ] F e b v x v y p ഥ𝐷 D (x) D (y) Figure 1: Representation of acoustic parameters ( p , v x , v y ) andtube depth ( ¯ D , D ( x ) and D ( y ) ) inside the 2D Yee scheme p ( n +1) = ¯ Dp ( n ) − ρc ∆ t (cid:101) ∇ · V ( n ) ¯ D (3) v ( n +1) = β v ( n ) − β ∆ t (cid:101) ∇ p ( n +1) /ρ + ∆ t (1 − β ) v b β + ∆ t (1 − β ) (4)with: V = ( D ( x ) v x , D ( y ) v y ) (5) Algorithm 1
FDTD Time-Marching algorithm
Input:
VT area function a ( x ) , audio time t
1: Initialize physical constants: air density ρ , sound speed c , boundary admit-tance coefficient µ .2: Initialize the simulation sampling rate R .3: Set the temporal resolution ( ∆ t ) and grid resolution ( ∆ s ) with R and CFLcondition.4: Create a optimized grid having size ( M × N ) .5: Set the total number of time steps T of the simulator using t and ∆ t .6: Define boundary cells with a ( x ) and normal acoustic impedance Z .7: Set depth values ( ¯ D , D ( x ) and D ( y ) ) for each grid cell from a ( x ) .8: Define source excitation cells9: Initialize source excitation velocity v e .10: Initialize acoustic components ( p , v x and v y ) for each grid cell.11: for n = 1 ...T do for i = 1 ...M do for j = 1 ...N do
14: Update p n +1 ( i, j ) with v nx ( i, j ) , v ny ( i, j ) and D ( i, j ) (Eq. 3and Eq.5)15: if ( i, j ) = excitation cell then v nx ← v nx + v ne and v ny ← v ny + v ne end if if ( i, j ) = boundary cell then v nb ← p n ( i, j ) /Z end if
21: Update v n +1 x ( i, j ) with p n +1 ( i, j ) and v nb (Eq. 4)22: Update v n +1 y ( i, j ) with p n +1 ( i, j ) and v nb (Eq. 4)23: end for end for end for To simulate vocal tract wall reflection, we adopted the local re-active boundary approach as proposed here (Yokota, Sakamoto,and Tachibana 2002; Takemoto, Mokhtari, and Kitamura 2010).The method estimates wall losses (locally reacting soft walls)and enforces boundary condition by using v w , the normal par-ticle velocity going into or coming from a wall. v w = p w Z n ˆ n (6)where p w is the current pressure value in an air cell, located infront of the wall cell. And Z n , the normal acoustic impedancecan be computed by using the normal sound absorption coeffi-cient α n as follows, (a) (b) Figure 2: (a) Simplified head geometry representation with cir-cular mouth aperture (b) Snapshot of acoustic wave propagationinside the vocal tract tube and emanating from mouth apertureat time instant t = 6 . ms for vowel /a/. Z n = ρc √ − α n − √ − α n (7)The boundary absorption coefficient α n can be derivedfrom the boundary admittance coefficient µ ∈ (0 , . Duringthe simulation, the solver combines equation (6) with equation(4) by setting v b = v w for all the velocity vectors which arepositioned in between an air cell and wall cell. However, thismethod does not comprehend frequency dependent wall losses. For acoustic analysis and speech production, articulatory mod-els require a source excitation function just above the glottal-end to input acoustic energy into the tube (Mohapatra and Fels2018). In this work, we are only characterizing the acoustic be-haviour of a vocal tract tube through its transfer function anal-ysis. Hence, a band-passed velocity pulse having frequencyrange 2Hz-20kHz was injected near the glottis for various vocaltract shapes, and the corresponding transfer functions (impulseresponse) were obtained at the mouth-end.
The outward wave propagation at the mouth-end (lips) con-tributes to acoustic energy losses in the vocal tract. It is oneof the essential dissipation mechanisms in voice production.Hence, a nonzero radiation model of a vocal tract influencesthe synthesized audio output by lowering the formant frequen-cies and increasing their bandwidths. It might be challengingto build a model that can precisely attain the radiation effects,considering a realistic human head geometry. However, it hasalready been demonstrated that the inclusion of a spherical baf-fle (a simplified approximation of the actual head geometry)around the 3D vocal tract model offers promising results forvowel production (Arnela, Guasch, and Alías 2013). Since the2D model uses a mid-sagittal cut of a 3D vocal tract, the spheri-cal baffle can be replaced by a circular baffle for the 2.5D ra-diation model as shown in
Figure 2(a) . Moreover, to illus-trate acoustic waves’ propagation emanating from the mouthtoward infinity (
Figure 2(b) ), the computational domain can beextended out of the vocal tract by using perfectly matched lay-ers (PMLs). PMLs help in eliminating spurious wave reflectionat the domain boundaries.We create a fixed-sized circular baffle with a diameter of0.20m around the 2.5D vocal tract contour inside the simulationdomain. The circular baffle center has been adjusted to fit inthe entire vocal tract contour while connecting its exit cross-section. Inside the 2.5D scheme, the region enclosed by thevocal tract and baffle boundary is set to air cells. In order toigure 3: 2.5D vocal tract tube representation for vowel /a/.absorb outgoing waves and approximate infinite space, layersof PML are added at all sides of the domain boundary.
3. Experimental Setup
A comparative study was carried out for two different scenar-ios to characterize the acoustic features of the 2.5D vocal tractmodel: (1) numerical simulation with computational domainended at the mouth aperture by imposing Dirichlet boundaryconditions (open-end boundary condition) and (2) numericalsimulation with free-field radiation, which includes head geom-etry and PML layers (radiation condition). These conditionswere examined using the transfer functions analysis approachfor five different vocal tract shapes(vowel sounds: /a/, /e/, /i/,/o/ and /u/). The 2.5D vocal tract shapes were derived usingStory’s area function dataset (Story 2008), as it is the standardone for the vocal tract acoustic analysis. We have only con-sidered the vocal tract geometry as a straight tube with circularcross-sections for this study as shown in
Figure 3 .For open-end boundary conditions, we use a precise 3DFEM vocal tract model as the reference for the transfer functionanalysis. A virtual microphone is placed 3mm inside the mouthopening to record pressure variations. In the case of free-fieldradiation, we compare the transfer functions of the 2.5D modelwith Story’s calculated formants, obtained from the frequency-domain analysis of vocal tracts (Sondhi and Schroeter 1987).And the microphone is positioned 3mm outside of the mouth tocapture the radiating acoustic waves. Courant-Friedrichs-Lewy(CFL) stability condition in two dimensions: ∆ t ≤ ∆ s/ √ c isimposed, where c is the speed of sound.The simulation generates ms of synthesized audio for ev-ery vowel sounds. During the simulation, we set the followingphysical parameters fixed: air density ρ = 1 . kg/m , bound-ary admittance µ = 0 . and sound speed c = 350 m/s.We implement both the 2D and 2.5D vocal tract models in theMATLAB environment. The simulation runs on a workstationequipped with an Intel Core i7-8700K processor. The customcode for the 2.5D vocal tract model is publicly available here .
4. Model Validation
We follow the below steps to analyze the acoustic behaviour ofthe 2.5D model: (A)
Transfer function analysis with open-end boundary con-dition for the 2D FDTD and 2.5D FDTD vocal tractmodels (vowel sound: /a/, /i/ and /u/). (B)
Model performance evaluation (simulation run-rime). https://github.com/Debasishray19/Talking-Tube (C) Transfer function analysis for the 2.5D vocal tract modelwith free-field radiation (vowel sound: /a/, /e/, /i/. /o/ and/u/).For Step (A) and (B), we simulate each vowel sound un-der three different spatial grid resolution for both 2D and 2.5DFDTD methods: low ( ∆ s = 0 . mm), mid ( ∆ s = 0 . mm)and high ( ∆ s = 0 . mm). The high spatial resolution shouldproduce precise acoustic output as it yields vocal tract geome-try with greater details. However, it requires a larger simulationdomain size. The grid resolution values are empirically derived.Step (A) and (B) examine the improvement in a 2.5D vocal tractmodel over its 2D simulation due to the addition of tube depth.Step (C) investigates the change in transfer function with theinclusion of a circular baffle for the radiation model.
5. Result
We applied Fast Fourier Transformation (FFT) to the recordedpressure waves to obtain their transfer function. Formantswere extracted by considering local maxima in transfer func-tion curves. Here we are only analyzing the first three formantssince they are mainly responsible for distinguishing vowelsounds (Vampola et al. 2015). The formants for the 3D FEMmodel can be found here (Arnela Coll 2015).
Transfer function analysis for the 2D and 2.5D simulation withopen-end boundary condition .Resolution ( ∆ s ) F1 F2 F3low (% Error) 660(-5.16)700(0.57) 1040(-2.61)1040(-2.61) 2960(-2.33)3020(-0.36)mid (% Error) 660(-5.16)680(-2.29) 1040(-2.61)1060(-0.74) 3000(-1.02)3060(0.95)high(% Error) 700(0.57)680(-2.29) 1060(-0.74)1040(-2.62) 3020(-0.36)3040(0.29)Table 1: Positional errors of the first three formants computedfor vowel sound /a/, with respect to 3D FEM model.Resolution ( ∆ s ) F1 F2 F3low (% Error) 240(-8.74)260(-1.14) 2120(0.42)2160(2.32) 3020(0.33)3060(1.66)mid (% Error) 260(-1.14)260(-1.14) 2140(1.37)2140(1.37) 3000(-0.33)3040(0.99)high(% Error) 260(-1.14)260(-1.40) 2140(1.37)2160(2.32) 3020(0.33)3040(0.99)Table 2: Positional errors of the first three formants computedfor vowel sound /i/, with respect to 3D FEM model.Resolution ( ∆ s ) F1 F2 F3low (% Error) 260(0.38)260(0.38) 700(-7.52)720(-4.88) 2260(-0.17)2300(1.59)mid (% Error) 220(-15.0)260(0.38) 600(-20.7)700(-7.52) 2260(-0.17)2300(1.59)high(% Error) 260(0.38)260(0.38) 700(-7.52)720(-4.88) 2280(0.70)2300(1.59)Table 3: Positional errors of the first three formants computedfor vowel sound /u/, with respect to 3D FEM model. Colormap: For 2D simulation, For 2.5D simulation. All the formant frequencies are in Hz. .2. Step B: Model Performance
The computational cost of FDTD vocal tract model dependsupon the spatial grid resolution and simulation domain size. Weimplement an algorithm that enforces an optimized domain sizefor FDTD models. The table below presents the simulation do-main size and time duration corresponding to each grid resolu-tion for vowel /u/.Resolution ( ∆ s ) Domain Size( width × height ) Duration (Approx.)( in seconds )low ( . mm ) × . mm ) ×
85 4 . e+ high( . mm ) ×
127 1 . e+ Table 4: Simulation run-time at different grid resolutions.
The table below presents transfer function analysis for the 2.5Dvocal tract models with the free-field radiation. We have onlyconsidered the low grid resolution ( ∆ s = 0 . mm) for simula-tion of vocal tract radiation models.Vowel F1 F2 F3/a/ − .
78% 10208 .
28% 30001 . /e/ − .
75% 19201 .
74% 2320 − . /i/ − .
00% 2080 − .
75% 2960 − . /o/ − .
80% 760 − .
04% 2340 − . /u/ − .
19% 7202 .
56% 2180 − . Table 5: Formant frequencies and the percentage positionalerrors of the first three formants for the vocal tract radiationmodel.
6. Discussion & Conclusion
The transfer function analysis is a standard practice to validateacoustic features of area-based articulatory models. In the firststep of our evaluation, the 2.5D FDTD simulation produces po-sitional errors below at a low grid resolution. Moreover,this characteristic is consistent across all three vowels. How-ever, the positional error is high for the 2D FDTD simulationat a low grid resolution in most cases. Table 1, 2 and 3 showthat the acoustic outputs of the 2D FDTD model can be tunedby increasing its computational grid resolution whereas this ap-proach doesn’t make much difference for the 2.5D FDTD simu-lation.
Table 4 shows for a real-time speech synthesizer the lowgrid resolution is always desirable.In the case of free-field of radiation,
Table 5 presents thefirst three formants and their positional errors. The positionalerror of the first formant is high across all the vowels. And thesecond and third formants produce relatively lesser error val-ues. However, these positional errors are determined relative toStory’s 1D vocal tract model, an oversimplified representationof the actual vocal tract. Hence, as future work, the 2.5D radia-tion model’s acoustic characteristics need to be compared witha 3D vocal tract radiation model.Additionally, our current 2.5D model should only be seen assimple approximations of the voice radiation mechanism. Nev- ertheless, the influence of few head details like the nose andcross-section of the mouth aperture is worth studying. As wellas, our current 2.5D FDTD model does not include boundarylosses for the tube depth. All these current limitations need tobe incorporated for the future development of the 2.5D vocaltract acoustic model.
7. Acknowledgements
This work is supported by the Natural Sciences and EngineeringResearch Council (NSERC).
8. References
Allen, Andrew and Nikunj Raghuvanshi (2015). “Aerophones in flat-land: Interactive wave simulation of wind instruments”. In:
ACMTransactions on Graphics (TOG)
The Journal of the Acoustical Society of America
TheJournal of the Acoustical Society of America arXiv preprintarXiv:1811.07435 .Mohapatra, Debasish Ray, Victor Zappi, and Sidney Fels (2019). “Anextended two-dimensional vocal tract model for fast acoustic simu-lation of single-axis symmetric three-dimensional tubes”. In: arXivpreprint arXiv:1909.09585 .Sondhi, Man and Juergen Schroeter (1987). “A hybrid time-frequencydomain articulatory speech synthesizer”. In:
IEEE Transactions onAcoustics, Speech, and Signal Processing
Tenth Annual Conference ofthe International Speech Communication Association .Story, Brad H (2008). “Comparison of magnetic resonance imaging-based vocal tract area functions obtained from the same speakerin 1994 and 2002”. In:
The Journal of the Acoustical Society ofAmerica
The Journal of theAcoustical Society of America
Logopedics PhoniatricsVocology
IEEETransactions on antennas and propagation
Acoustical science and technology