Sound Field Translation and Mixed Source Model for Virtual Applications with Perceptual Validation
Lachlan Birnie, Thushara Abhayapala, Vladimir Tourbabin, Prasanga Samarasinghe
11 Sound Field Translation and Mixed Source Modelfor Virtual Applications with Perceptual Validation
Lachlan Birnie ∗§ , Thushara Abhayapala ∗ , Vladimir Tourbabin † , Prasanga Samarasinghe ∗∗ Audio & Acoustic Signal Processing Group, The Australian National University, Canberra, Australia † Facebook Reality Labs, Redmond, Washington, USA
Abstract —Non-interactive and linear experiences like cinemafilm offer high quality surround sound audio to enhance im-mersion, however the listener’s experience is usually fixed to asingle acoustic perspective. With the rise of virtual reality, thereis a demand for recording and recreating real-world experiencesin a way that allows for the user to interact and move withinthe reproduction. Conventional sound field translation techniquestake a recording and expand it into an equivalent environmentof virtual sources. However, the finite sampling of a commercialhigher order microphone produces an acoustic sweet-spot inthe virtual reproduction. As a result, the technique remains torestrict the listener’s navigable region. In this paper, we propose amethod for listener translation in an acoustic reproduction thatincorporates a mixture of near-field and far-field sources in asparsely expanded virtual environment. We perceptually validatethe method through a Multiple Stimulus with Hidden Referenceand Anchor (MUSHRA) experiment. Compared to the planewavebenchmark, the proposed method offers both improved sourcelocalizability and robustness to spectral distortions at trans-lated positions. A cross-examination with numerical simulationsdemonstrated that the sparse expansion relaxes the inherentsweet-spot constraint, leading to the improved localizability forsparse environments. Additionally, the proposed method is seento better reproduce the intensity and binaural room impulseresponse spectra of near-field environments, further supportingthe strong perceptual results.
Index Terms —Sound field navigation, translation, virtual-reality reproduction, binaural synthesis, MUSHRA, higher ordermicrophone.
I. I
NTRODUCTION
Virtual reality devices will provide a novel framework forpeople to interact with each other at a higher social bandwidththrough immersive audio and visual reproductions of the real-world [1], [2]. For example, in the future a person may beable to experience a live concert or orchestral performancethrough a virtual reproduction in their own home [3]. Tocomplete the immersive experience, the listener/viewer shouldbe allowed to explore and interact with the virtual reproduction[4]. Subsequently, methods to accurately record and model theperceptual change in visual and auditory information as theuser translates are required to maintain the original experience.Camera arrays have been used to capture visual informationat multiple points-of-view for use in virtual reproductions [5].Similarly, microphones distributed about an environment canrecord the spatial auditory scene from multiple points-of-view[6]. However, hardware and feasibility restrictions limit the § This research is supported by an Australian Government Research TrainingProgram (RTP) Scholarship, and Facebook Reality Labs. continuous space that can be recorded, and as a result, duringreproduction the listener is usually stuck in the fixed acousticperspective of the microphone [7].Recently, there have been two key approaches towardsextending the auditory range that a listener can navigate insidea virtual sound field reproduction. These are an interpolation-based [8] and an extrapolation-based approach [9]. Interpo-lation approaches utilize a grid of higher order microphonesdistributed about the acoustic space, and interpolate the soundto the listener during reproduction [10], [11]. Better colorationand localization performance is expected from interpolationthan extrapolation [8]. However, interpolation may not befeasible for all real-world scenarios due to the large spatial,hardware, and synchronization costs associated with construct-ing a microphone grid [12]. Furthermore, typically listenersare confined to the interior region of the grid [13], and soundsources within the grid may cause comb-filtering spectraldistortions [14]. Methods that alleviate these drawbacks andallow the listener to translate beyond the grid have been de-veloped, however, they usually require additional localizationand separation of direct sound field components [15], [16].On the other hand, the extrapolation-based approach ex-pands a single higher order microphone’s recording outwardsto the translated listener’s position [17]. As a result, extrapo-lation overcomes many of the hardware and spatial drawbacksof the interpolation approach. Because a single microphoneis utilized, the audio and visual capture system can occupy asingle seat in the audience of a live event, which causes lessobstruction and allows for more impromptu recordings.Many extrapolation-based sound field translation methodshave been developed to allow listener navigation in virtualreproduction, such as Ambisonic [17], [18], harmonic re-expansion [19], discrete source [20], and point-source dis-tribution [21]. One of the most popular extrapolation-basedmethods which we consider to be the benchmark in this paper,is the planewave method [22]. In this method, the recordedacoustic environment is expanded into a secondary distributionof virtual planewave sources [23]. The secondary virtualenvironment constructs a sound field that is extrapolated fromthe microphone and is equivalent to the recording. The listenercan perceptually move about the reproduction by translatingthe secondary environment’s sound field [20], [22].In practice, however, most extrapolation-based approaches,including the planewave method, are constrained by the higherorder microphone used for recording [17]. Hardware limi-tations result in an approximate and truncated sound field a r X i v : . [ ee ss . A S ] J u l recording that is confined to a finite region [24], where thetruncated recording is restricted by both the upper frequencyband and the microphone radius [25]. As a result, the listenercan only navigate within a small acoustic sweet-spot of a fewcentimeters which is defined by the commercial microphone’ssize [26]. Attempting to navigate beyond this inherent sweet-spot region, even after extrapolation, results in spectral distor-tions [27]–[29], degraded source localization [17], [30], and apoor perceptual listening experience.Studies have shown that compensating for near-field ef-fects attributes to better sound field reproduction [31]. Somereproduction methods have been able to model near-fieldsources with the use of prior knowledge of the source positionor additional source localization processing [16], [32]–[34].However, the planewave translation method is limited by itsfar-field virtual source model, which makes the reproductionof near-field propagation difficult [35].In this paper we propose an alternative secondary sourcemodel for an extrapolated virtual reproduction that enablesboth a near-field and far-field propagation mixture. The methodis built upon the benchmark planewave method which wereview in Section II. We expand the truncated recording ofa commercial higher order microphone [26], [36], [37] intoa mixture of secondary virtual sources that are distributed inboth the near-field and far-field (Section III), to create a moreperceptually accurate reproduction. In addition to the sourcemixture, we also propose using a L1-norm regularization [38]to sparsely expand the recording into the equivalent virtualenvironment (Section III-C), as it has been shown to helpextrapolate mode-limited sound fields [39]–[41].We initially proposed the near-field far-field source mixturein [42] without any substantial verification of the method’sperceptual performance. In this paper, we study the perceptualaspects of the source mixture through a perceptual listeningtest with human subjects and investigate the results againstnumerical simulations of the extrapolated pressure and in-tensity fields. The perceptual evaluation presented in SectionIV, utilizes a MUltiple Stimulus with Hidden Reference andAnchor (MUSHRA) [43], [44] framework adapted for use in avirtual environment to provide listeners with both an auditoryand visual reference of the real-world environment [12], [33].We compare four translation methods with differing virtualsource models and expansion techniques. We test the methodsfor the reproduction of human speech and music against themetrics of source localizability and robustness to spectral dis-tortions. We will show that the proposed method offers greaterperceptual accuracy and a more immersive experience forlisteners moving throughout an expanded virtual reproduction.In Section V, we conduct a simulation study and show thatthe proposed method’s perceptual performance is likely dueto its ability to better reconstruct the near-field pressure andintensity of the original environment. We give our concludingremarks and suggestions for future work in Section VI.II. P ROBLEM F ORMULATION AND THE P LANEWAVE S OUND F IELD T RANSLATION M ETHOD
In this section, we formulate the problem of reconstructinga recorded real-world experience such that a listener is able to perceptually move through the acoustic reproduction. Wefirst present the process of recording a general sound fieldwith a commercial higher order microphone. We then reviewthe planewave sound field translation method presented in[22], which segments the reproduction into three parts. First,a virtual acoustic environment is built from a superpositionof planewave sources. Second, planewave driving signals areestimated from the recording to model an equivalent acousticenvironment. Third, a listener is placed inside the virtualequivalent environment and binaural signals are rendered asthey move. We provide a discussion on the perceptual short-comings of this planewave translation method at the end.
A. Sound Field Capture
Consider a real-world acoustic environment that containsmultiple sound sources, for example, a musical performancewith many instruments. Let the origin o denote the centerof the environment’s listening space, such as a seat in themiddle of the audience. Each sound source is positioned at z = ( r, θ, φ ) with respect to o , where θ ∈ [0 , π ] is theelevation angle downwards from the z-axis, and φ ∈ [0 , π ) is the azimuth angle counterclockwise from the x-axis. Fora listener in the audience at position d , the true sound theyexperience in the real-world can be described by (real) { l,r } P ( k, d ) = U (cid:88) µ =1 H { l,r } ( k, z µ ; d ) × s µ ( k ) , (1)where (real) { l,r } P ( k, d ) is the pressure at the listener’s left and rightear, H { l,r } ( k, z ; d ) is the transfer function between each sourceand the listener’s ears, or simply the Head-Related TransferFunction (HRTF) when the listener is in a free-field spacewithout any reflections, s µ ( k ) is the sound signal of the µ th source, µ = (1 , · · · , U ) , k = 2 πf /c is the wave number, f isthe frequency, and c is the speed of sound. From here on, weassume H to be the free-field HRTF for simplicity.The aim is to record and reproduce the real-world auditoryexperience of (1) for every possible listening position. Thehomogeneous sound field that encompasses every arbitrarylistening position x , where | x | < | z | , can be expressed througha spherical harmonic decomposition of [45] P ( k, x ) = ∞ (cid:88) n =0 n (cid:88) m = − n α nm ( k ) j n ( k | x | ) Y nm (ˆ x ) , (2)where | · | ≡ r , ˆ · ≡ ( θ, φ ) , n and m are index terms denotingspherical harmonic order and mode, respectively, j n ( · ) arethe spherical Bessel functions of the first kind, Y nm ( · ) arethe set of spherical harmonic basis functions, and α nm ( k ) are the sound field’s coefficients which completely describethe source-free acoustic environment centered about o when α nm ( k ) is known for all n ∈ [0 , ∞ ) .In practice, the real-world acoustic environment can berecorded with an N th order microphone, by estimating thesound field’s α nm ( k ) coefficients for a finite set of n ∈ [0 , N ] .Consider a N th order microphone centered at o , such as aspherical [26] (or planar [46], [47]) microphone array. Themicrophone array consists of q = (1 , · · · , Q ) pressure sensors that enclose the spherical acoustic region (listening space) ofradius | x Q | to be recorded. The sound field within this regioncan be estimated with [48] α nm ( k ) ≈ Q (cid:88) q =1 w q P ( k, x q ) Y ∗ nm (ˆ x q ) b n ( k | x Q | ) , n ∈ [0 , N ] , (3)where w q are a set of suitable sampling weights [49], and b n ( · ) is the rigid baffle equation [45].However, commercial N th order microphones can onlyrecord a small acoustic region ( | x Q | < . m [26] ) due tothe hardware complexity and size constraint trade-offs withthe spatial sampling Nyquist theorem [24]. The microphone’struncation order is restricted by the limited number of sensors,such that Q ≥ ( N + 1) . Furthermore, the microphone’srecording region and frequency range are balanced by the N = (cid:100) k | x Q |(cid:101) rule [50]. These two microphone propertiesdefine a maximum | x Q | inside which the sound field iseffectively of order ≤ N . Beyond | x Q | , the reconstructedsound field requires higher orders > N which are unknown,resulting in truncation error that degrades perceptual accuracy.When reconstructing (1) from the recording, the left andright ear signals for the listener can be reassembled in thespherical harmonic domain by [51], [52] (mic) { l,r } P ( k, o ) = N (cid:88) n =0 n (cid:88) m = − n H nm { l,r } ( k ) × α nm ( k ) , (4)where H nm { l,r } ( k ) are the spherical harmonic decomposition co-efficients of the HRTF H { l,r } ( k, z ; o ) . In the reproduction (4),truncation forces the listener to the fixed auditory perspectiveof the microphone at o . If the listener attempts to move thenthey would immediately translate beyond the | x Q | boundaryand begin to experience spectral distortions, degraded sourcelocalization performance, and a loss in perceptual immersion.The objective of this paper is to relax this sweet-spotspatial constraint when reconstructing the sound field of acommercial microphone recording, and to build an equivalentvirtual environment that allows for a listener to move aboutthe acoustic space with a sustained perceptual immersion. Forthe remainder of this section we review the planewave soundfield translation method that we consider to be the baselinemethod for enabling listener navigation. B. Planewave Distribution
The planewave sound field translation method aims toconstruct a virtual acoustic environment that is perceptuallyequivalent to the real-world recording. The building block ofthis virtual environment is the planewave source, whose soundfield is modeled as P ( k, x ) = e − ik ˆ y · x π , (5)where ˆ y denotes the planewave’s incident direction. It isknown that any acoustic free field can be modeled by aninfinite superposition of planewaves [23]. Therefore, the equiv-alent virtual environment is constructed from a sphericaldistribution of virtual planewave sources, expressed as (pw) P ( k, x ) = (cid:90) ψ ( k, ˆ y ; o ) e − ik ˆ y · x π d ˆ y , (6) where ψ ( k, ˆ y ; o ) denotes the driving function of the planewavedistribution as observed at o . If the driving function ismodeled correctly then the planewave distribution can re-create the acoustic environment, such that (pw) P ( k, x ) = (real) P ( k, x ) . To achieve this, the driving function needs tobe estimated/expanded from the recorded α nm ( k ) coefficients,which we describe next. C. Planewave Expansion
The sound field about o due to a single virtual planewavecan be expressed by the decomposition of [45] e − ik ˆ y · x π = ∞ (cid:88) n =0 n (cid:88) m = − n ( − i ) n Y ∗ nm (ˆ y ) j n ( k | x | ) Y nm (ˆ x ) . (7)Additionally, the driving function centered at o can also beexpressed in terms of a harmonic decomposition, given as ψ ( k, ˆ y ; o ) = ∞ (cid:88) n (cid:48) =0 n (cid:48) (cid:88) m (cid:48) = − n (cid:48) β n (cid:48) m (cid:48) ( k ) Y n (cid:48) m (cid:48) (ˆ y ) , (8)where β nm ( k ) are the spherical harmonic decomposition co-efficients of ψ ( k, ˆ y ; o ) which describe the sound field aboutthe planewave distribution. Substituting both (7) and (8) into(6) gives the planewave distribution’s sound field in sphericalharmonics, as (pw) P ( k, x ) = ∞ (cid:88) n =0 n (cid:88) m = − n ( − i ) n β nm ( k ) (cid:124) (cid:123)(cid:122) (cid:125) α nm ( k ) j n ( k | x | ) Y nm (ˆ x ) . (9)From (9), the relationship between the β nm ( k ) coefficientsand the recorded α nm ( k ) coefficients can be extracted. Rear-ranging this relationship for β nm ( k ) = i n α nm ( k ) , expressesa planewave distribution that is equivalent to the recordedenvironment. Substituting this relationship back into (8), givesa closed-form expansion for a planewave driving function thatmatches the recording, ψ ( k, ˆ y ; o ) = N (cid:88) n =0 n (cid:88) m = − n i n α nm ( k ) Y nm (ˆ y ) . (10)Synthesizing a virtual environment with this driving functionthrough (6) produces a sound field that is equivalent to therecording. However, the recording (3) is only an approximationof the real environment, and therefore (10) is also an approx-imate, such that (pw) P ( k, x ) ≡ (mic) P ( k, x ) ≈ (real) P ( k, x ) . D. Planewave Auralization
A listener inside the planewave distribution is immersedwithin a spatial reproduction of the real-world acoustic envi-ronment. The binaural signals for the listener at the distributioncenter can be presented by exchanging their HRTF into (6),giving (pw) { l,r } P ( k, o ) = (cid:90) ψ ( k, ˆ y ; o ) H { l,r } ( k, ˆ y ; o ) d ˆ y . (11)Furthermore, the planewave distribution allows for the listenerto move perceptually about the reproduction. The sound heard Fig. 1: Illustration of the equivalent virtual planewave distri-bution. The listener’s perspective is fixed at the distributioncenter o , where a phase shift applied to the driving functiontranslates the sound field about the listener.by the listener who is translated to x = [ o + d ] ≡ d can bederived from (6) as (pw) P ( k, [ o + d ]) = (cid:90) ψ ( k, ˆ y ; o ) e − ik ˆ y · [ o + d ] π d ˆ y = (cid:90) ψ ( k, ˆ y ; o ) e − ik ˆ y · d e − ik ˆ y · o π d ˆ y . (12)It is observed from (12) that the translation in space differsonly by a phase shift in the planewave driving function.Therefore, applying the translational phase shift of [35] ψ ( k, ˆ y ; d ) = ψ ( k, ˆ y ; o ) × e − ik ˆ y · d , (13)to the binaural signals in (11), allows for the listener todynamically move their acoustic perspective by (pw) { l,r } P ( k, d ) = (cid:90) ψ ( k, ˆ y ; d ) H { l,r } ( k, ˆ y ; o ) d ˆ y . (14)In practice, the virtual planewave distribution (6) can berealized with a discrete set of known sources, (pw) P ( k, x ) ≈ L (cid:88) (cid:96) =1 w (cid:96) ψ ( k, ˆ y (cid:96) ; o ) e − ik ˆ y (cid:96) · x π (15)where (cid:96) = (1 , · · · , L ) index each virtual planewave, L isthe total number of sources, and w (cid:96) are a set of suitablesampling weights. Similarly, the dynamic binaural signals canbe realized from the discrete distribution with (pw) { l,r } P ( k, d ) ≈ L (cid:88) (cid:96) =1 w (cid:96) ψ ( k, ˆ y (cid:96) ; d ) H { l,r } ( k, ˆ y (cid:96) ; o ) . (16)We illustrate this planewave method to sound field transla-tion in Fig. 1. The reproduction is expressed by many discreteplanewave signals that are known continuously throughoutthe virtual environment. Therefore, the planewave methoddoes not explicitly limit the amount the listener can translate.However, (16) uses ψ ( k, ˆ y ; o ) which is estimated through (3)and (10). As a result, the N th order truncation inherentlyremains, and the listener’s movement is still implicitly limited. E. Discussion
The virtual planewave expansion enables listener transla-tion, however, some shortcomings are still exhibited in thelistener’s perception: • As mentioned, the planewave method inherits truncationartifacts through an over-approximation of (10), and thelistener’s movement remains inherently restricted insidethe virtual reproduction. As the listener translates furtheraway from the recording’s sweet-spot, they begin to ex-perience spectral distortions, a loss in source localization,and poorer perceptual accuracy. • The planewave expansion has difficulties in synthesizingnear-field sound sources due to its far-field source model. • The planewave auralization (16) fixes the HRTF perspec-tive to the virtual distribution center, H { l,r } ( k, ˆ y (cid:96) ; o ) , andperforms translation by phase shifting the sound fieldwith (13). However, the HRTF propagation vectors ˆ y (cid:96) ; o remain un-shifted, and as a result, the HRTF models un-translated head reflections as the listener moves.In the next section we propose an alternative sound fieldtranslation model to address the above shortcomings.III. M IXEDWAVE S OUND F IELD T RANSLATION M ETHOD
In this section, we define a virtual source that models botha near-field and far-field propagation, which we will refer toas a mixedwave source. We then build a virtual distributionof mixedwave sources and expand a real-world recordinginto an equivalent sound field. Additionally, we also proposea sparse method for expanding a virtual source distributionthat alleviates some of the spatial restrictions imposed by thetruncated recording.
A. Mixture of Near-Field and Far-Field Sources
Here, we define the virtual source that will be the buildingblock for our proposed method. Consider a near-field point-source at y , where the driving signal of the source withrespect to itself is denoted ˙ ψ ( k, y ) . We can express the drivingfunction observed at a position x with [45] ψ ( k, y ; x ) = ˙ ψ ( k, y ) e ik || y − x || || y − x || . (17)Evaluating (17) when x = o gives the driving functionobserved by a receiver/microphone, as ψ ( k, y ; o ) = ˙ ψ ( k, y ) e ik | y | | y | . (18)Rearranging (18) gives an expression for the source signalin terms of the source’s distance and the driving functionobserved by the receiver, ˙ ψ ( k, y ) = ψ ( k, y ; o ) | y | e − ik | y | . (19)Substituting (19) back into (17) provides the driving functionobserved at any arbitrary point x in terms of the functionobserved by the receiver/microphone, expressed as ψ ( k, y ; x ) = ψ ( k, y ; o ) | y | e − ik | y | (cid:124) (cid:123)(cid:122) (cid:125) ˙ ψ ( k, y ) e ik || y − x || || y − x || . (20) We note that the | y | e − ik | y | term can be seen to have redefinedthe point-source from being a function with respect to itself, tobeing a function with respect to o . This allows us to observethe source distribution at o with a microphone and estimatethe sound at any translated position x .Additionally, the constant term has the property of [25] lim | y |→∞ | y | e − ik | y | e ik || y − x || π || y − x || = e − ik ˆ y · x π , (21)which allows for a mixture of near-field and far-field virtualsource distributions to be modeled with this building block.We define this building block as the mixedwave source, P ( k, x ) = | y | e − ik | y | e ik || y − x || π || y − x || . (22)In the spherical harmonic domain, | y | e − ik | y | e ik || y − x || π || y − x || = ∞ (cid:88) n =0 n (cid:88) m = − n ik | y | e − ik | y | h n ( k | y | ) Y ∗ nm (ˆ y ) j n ( k | x | ) Y nm (ˆ x ) , (23)where h n ( · ) is the spherical Hankel function of the first kind.We note that spherical Hankel functions also have lim | y |→∞ ik | y | e − ik | y | h n ( k | y | ) = ( − i ) n , (24)to correspond with (21). We can observe from (24) that whenthe mixedwave source is placed in the far-field, the definitionof (23) will match that of the planewave source (7). Thisproperty then allows for both a near-field sound propagation tobe modeled by a mixedwave distribution with a small radius,and a far-field sound propagation modeled by a mixedwavedistribution with a large radius. We will use this near-fieldand far-field distribution of mixedwave sources as the basis ofour proposed sound field translation method next. B. Mixedwave Method for Sound Field Translation
Following the planewave translation method, our proposedmixedwave translation method is also broken into three parts.
1) Mixedwave Distribution:
We propose constructing avirtual equivalent sound field from two concentric sphericaldistributions of mixedwave sources. The first virtual sphere isplaced in the near-field with a radius of R (nf) , and the secondsphere is placed at R (ff) in the far-field, such that (mw) P ( k, x )= (cid:90) ψ ( k, R (nf) ˆ y ; o ) R (nf) e − ikR (nf) e ik || R (nf) ˆ y − x || π || R (nf) ˆ y − x || d ˆ y + (cid:90) ψ ( k, R (ff) ˆ y ; o ) R (ff) e − ikR (ff) e ik || R (ff) ˆ y − x || π || R (ff) ˆ y − x || d ˆ y , (25)where, ψ ( k, R ˆ y ; o ) , R ∈ { R (nf) , R (ff) } , are the driving func-tions of the two mixedwave distributions centered at o .
2) Mixedwave Expansion:
Following the procedure in Sec-tion II-C, we can decompose the ψ ( k, R ˆ y ; o ) driving functioninto spherical harmonic aperture coefficients of β n (cid:48) m (cid:48) ( k, R ) ,expressed as ψ ( k, R ˆ y ; o ) = ∞ (cid:88) n (cid:48) =0 n (cid:48) (cid:88) m (cid:48) = − n (cid:48) β n (cid:48) m (cid:48) ( k, R ) Y n (cid:48) m (cid:48) (ˆ y ) . (26)We substitute both (26) and (23) into (25) to extract therelationship between β nm ( k ) and α nm ( k ) , given as β nm ( k, R ) = N (cid:88) n =0 n (cid:88) m = − n α nm ( k ) ikRe − ikR h n ( kR ) . (27)Finally, we substitute (27) back into (26) to derive a closed-form expansion for the mixedwave driving functions in termsof the recorded coefficients, ψ ( k, R ˆ y ; o ) = N (cid:88) n =0 n (cid:88) m = − n α nm ( k ) ikRe − ikR h n ( kR ) Y nm (ˆ y ) . (28)We use a set of real-world recorded coefficients α nm ( k ) with (28) to estimate the driving functions of the near-fieldand far-field virtual distributions, such that (mw) P ( k, x ) ≡ (mic) P ( k, x ) ≈ (real) P ( k, x ) .
3) Mixedwave Auralization:
Consider a listener inside thevirtual mixedwave distribution at the translated position x =[ o + d ] ≡ d , | d | < R (nf) , as shown in Fig. 2. We renderthe left and right binaural signals by applying the mixedwavedriving function to the HRTF based on the listener’s translatedposition, given as [17] (mw) { l,r } P ( k, d ) = (cid:90) ψ ( k, R ˆ y ; o ) H { l,r } ( k, R ˆ y ; d ) d y , (29)where R ˆ y ; d denotes the propagation direction of the mixed-wave source with respect to d , which is given by ( y − d ) .We note that this is possible for the mixedwave distributiondue to the finite positions of each source, unlike the infinitedefinitions of planewave sources.Once again, a set of discrete sources can be used to practi-cally realize the virtual mixedwave distributions, expressed as (mw) P ( k, x ) ≈ L (cid:88) (cid:96) =1 w (cid:96) ψ ( k, R (nf) ˆ y (cid:96) ; o ) R (nf) e − ikR (nf) e ik || R (nf) ˆ y (cid:96) − x || π || R (nf) ˆ y (cid:96) − x || + L (cid:88) (cid:96) =1 w (cid:96) ψ ( k, R (ff) ˆ y (cid:96) ; o ) R (ff) e − ikR (ff) e ik || R (ff) ˆ y (cid:96) − x || π || R (ff) ˆ y (cid:96) − x || , (30)where the near-field and far-field distributions each contain L sources. Similarly, we realize the mixedwave auralizationwithin the discrete virtual distributions by (mw) { l,r } P ( k, d ) = L (cid:88) (cid:96) =1 w (cid:96) ψ ( k, y (cid:96) ; o ) H { l,r } ( k, y (cid:96) ; d ) , (31)where | y (cid:96) | = R (nf) for (cid:96) ∈ [1 , L ] , and | y (cid:96) | = R (ff) for (cid:96) ∈ [ L + 1 , L ] , and y (cid:96) ; d is the propagation direction of the (cid:96) th mixedwave source with respect to the translated listener.Unlike the planewave method, the maximum distance a listenercan translate within the mixedwave environment is restricted Fig. 2: Illustration of the equivalent virtual mixedwave soundfield. The listener is translated to d , and the vectors ( y (cid:96) ; d ) areupdated with the HRTF to auralize an immersive reproduction.by R (nf) . However, we suspect that R (nf) can be selected tomatch the size of a small real-world room that is recorded. C. Sparse Expansion Methods
The closed-form expansion constructs a virtual environmentthat is equivalent to the original recording. However. theexpansion distributes energy ψ ( k, y (cid:96) ; o ) throughout all virtualsources. This causes an over-approximation of the truncatedrecording’s underlying spatial artifacts. As a result, the amounta listener can translate before experiencing a loss in immersionis still inherently restricted by the recording’s truncation.Furthermore, it is believed that modeling fewer virtual sourcesfrom propagation directions that are similar to the originalenvironment will lead to better perceptual immersion [42].For these reasons, we propose a sparse constrained expansionmethod for constructing our virtual mixedwave environment.The coefficients α nm ( k ) observed at the center of a virtualdistribution can be expressed in matrix form as Aψ = α , (32)where α = [ α ( k ) , α − ( k ) , · · · , α NN ( k )] T are the recordedcoefficients, ψ = [ ψ ( k, y ; o ) , · · · , ψ ( k, y L ; o )] are the L equivalent virtual source driving signals, and A is the ( N + 1) by L expansion matrix. The entries of A aregiven by ( − i ) n Y ∗ nm (ˆ y (cid:96) ) for a planewave expansion (10),and ik | y (cid:96) | e − ik | y (cid:96) | h n ( k | y (cid:96) | ) Y ∗ nm (ˆ y (cid:96) ) where L = 2 L for thetwo source distributions of a mixedwave expansion (28). Weassume L > ( N + 1) for the under-determined case.We construct a sparse source distribution by solving thelinear regression problem (32) using Iteratively ReweightedLeast Squares (IRLS) [38]. In brief, the IRLS approachreplaces the (cid:96) p -objective function min ψ || ψ || pp subject to Aψ = α , (33)with a weighted (cid:96) -norm, min ψ L (cid:88) i =1 w i ψ i subject to Aψ = α , (34) where w i = | ψ ( ν − i | p − are the weights computed from theprevious iterate ψ ( ν − . The next iterate is given by ψ ( ν ) = Q ν A T (cid:0) A Q ν A T (cid:1) − α , (35)where Q ν is the diagonal matrix with /w i = | ψ ( ν − i | − p .Other regularization techniques can also be utilized, such asthe Least-Absolute Shrinkage and Selection Operator (Lasso)[53], [54], and we direct the reader to [55] for further infor-mation in regards to compressive sensing. D. Discussion
Continuing our discussion on the planewave method’s short-coming in Sec. II-E, we give the following comments: • Sparsely expanding the virtual source distribution (32)with IRLS is expected to further enhance the percep-tual immersion for a listener, as they should experiencemore localized virtual sources. Additionally, the spar-sity relaxes the spatial sweet-spot restriction and over-approximation issue stemming from the closed-form ex-pansion used by the planewave method. These propertiesare demonstrated by experiment in Section IV and bysimulation in Section V. • The mixedwave distribution can easily synthesize near-field sound sources. The modified point-source (22) canmodel a spherical-wave propagation by simply position-ing the mixedwave source in the near-field. • The mixedwave auralization (31) translates the HRTFwith the listener. As a result, the propagation vectors in H { l,r } ( k, y (cid:96) ; d ) are updated with d to render changes inhead reflection, similar to virtual higher-order Ambison-ics [17]. Intuitively, this is expected to result in greaterperceptual immersion.We examine the perceptual advantages of the sparse expan-sion and the mixedwave source model against the planewavebenchmark experimentally in the next section.IV. P ERCEPTUAL E XPERIMENT
Our aim is to maintain the immersion for a listener inside anacoustic reproduction. Therefore, it is of crucial importance,foremost, that we evaluate the proposed method against theplanewave benchmark in a perceptual listening experiment.This section outlines the perceptual experiment system weimplemented and presents the statistical results at the end.
A. Experiment Methodology1) Compared Methods:
We conducted a MUSHRA percep-tual experiment to compare four translation methods. In totalthe experiment presented six signals: • Reference / hidden reference:
Signals of the true free-fieldtransfer function between a real-world point-source andthe translated listener, given by (1). • Anchor:
Signals of the truncated recording that is fixedspatially to the microphone’s position (4). Sound fieldrotation is still rendered, but no translation is processed.This is a similar anchor to the three-degrees-of-freedomused in [33]. • Benchmark / planewave closed-form (PW-CF):
Signalsrendered of a virtual planewave distribution (16) that areexpanded through the closed-form expression (10). • Planewave IRLS (PW-IRLS):
Signals of a IRLS (SectionIII-C) sparsely expanded planewave distribution. • Mixedwave closed-form (MW-CF):
Signals rendered ofa virtual mixedwave distribution (31) that are expandedthrough the closed-form expression (28). • Proposed method / mixedwave IRLS (MW-IRLS):
Signalsof a IRLS sparsely expanded mixedwave distribution.The experiment comprised of four tests with two scoringmetrics, source localization and basic audio quality , and twosound-signals, speech and music . The source localization testasked listeners to score on the perceived direction of the sound-source, the source width, and the sound field sparseness withrespect to both a visual-reference and the reference signal.Whereas, the basic audio quality test asked listeners to scoreagainst the reference for spectral distortions and other audibleprocessing artifacts. In total the scores of 17 participants werecollected for the speech sound-source, and 11 scores for themusic sound-source. The recording microphone was shown inthe virtual environment, and listeners were informed that thefurther they translate, the greater the differences they shouldperceive between methods. We asked the listeners to scorewhile accounting for each method’s performance over a msquare reproduction space.
2) Experiment System:
We used an Oculus Rift along witha pair of Beyerdynamic DT 770 pro headphones to track thelistener and provide a visual reference of the true sound source.We used the HRTFs of the FABIAN head and torso simulator[56] from the HUTUBS dataset [57], [58] for auralization.The HRTFs were rotated for each test signal by multiplyingthe HRTF coefficients with Wigner-D functions [59]. Signalswere processed at a frame size of with overlap anda kHz sampling frequency, due to hardware constraints andthe computational costs of the real-time experiment.
3) Virtual Environment:
We simulated the real-world audi-tory experience with a single free-field point-source in orderto generate a true experiment reference signal for the listenerat every position. We constructed a virtual environment with o placed at the center, and the XY-plane . m above theground to align with a listener’s head while sitting. Wemodeled the true sound-source with a static point-source at (1 , , m. By true , we signify that the sound field generatedby this point-source is denoted as the real-world auditoryexperience we record and reproduce.Additionally, we also simulated the process of recordingthe truncated sound field of the true point-source. We used a th order rigid 36-sensor spherical microphone array centeredat o . Microphone sensors were distributed at Fliege positions[60] with . m radius to best represent a commercial mi-crophone [26]. Recordings were generated by convolving thesound-source’s signal with the microphone’s impulse response.The α nm ( k ) coefficients were extracted with (3) before beingexpanded into virtual distributions.The planewave distribution consisted of L = 36 virtualsources at Fliege positions [60]. This selection was made as atrade-off with computation complexity. However, adding more planewaves is not expected to improve source localizationperformance, as the distribution already over-samples the th order recording [9], [28], [30]. Similarly, the mixedwavedistribution consisted of two sets of L = 36 virtual sourcesat the same Fliege positions. The first set was distributed inthe near-field at R (nf) = 2 m, and the second was placed at R (ff) = 20 m in the far-field.
4) Experiment Auralization:
The reference was renderedby convolving (in frequency domain) the sound-source signalwith the true source-to-listener HRTF. For the anchor (4), thesignals were convolved by multiplying α nm ( k ) with the spher-ical HRTF-coefficients directly [52]. The planewave methodsignals were rendered with the convolution of the HRTF at o and the phase-shifted driving function (16). The phase-shift was updated with the Oculus head position to renderperceptual translation. For the mixedwave methods, the HRTFswere reconstructed between each source and the listener’stranslated position. Binaural signals were then rendered withthe convolution of the mixedwave driving function and the y (cid:96) -to- d HRTF (31).
B. Experiment Results1) Box Plot:
Figure 3 shows the perceptual scores of thetranslation methods across all four tests. A Lilliefors test ( p val > . showed that our collected scores met the require-ment for normal distribution, and a Tukey-Kramer multiplecomparison test with confidence was used to determinestatistical significance. We discuss the results of these scoresthrough an analysis of variance (ANOVA) examination.
2) One-factor ANOVA results:
We used a one-factorANOVA to determine if any of the translation methods per-formed significantly different in each of the perception tests.For speech localization (Fig. 3a), both MW-CF and MW-IRLS showed a significant improvement in score ( F (3 , =6 . , p < . compared to the PW-CF benchmark. Similarresults ( F (3 , = 16 . , p < . are shown for speechquality (Fig. 3b), where the mixedwave methods were foundto be significantly different to PW-IRLS in addition to thebenchmark. In the music sound-source tests (Fig. 3c and Fig.3d), only MW-IRLS showed significantly improved meansover the benchmark, while MW-CF did not. However, MW-CF was still observed to perform well for music localizationin Fig. 3c and music quality in Fig. 3d as indicated by thesignificant median scores.
3) Two-factor ANOVA results:
We performed a two-factorANOVA to compare the effects of source-type (planewave andmixedwave) and expansion-type (closed-form and IRLS). Inall four tests ( p ≤ . , mixedwave source distributionswere found to score higher means than planewave distribu-tions. Whereas, a significant difference in expansion-type wasonly found in the speech sound-source tests, with IRLS show-ing better scores. For music localization ( F (1 , = 3 . , p =0 . and music quality ( F (1 , = 1 . , p = 0 . , nosignificant difference was found between closed-form andsparse expansions. Lastly, no interaction effects ( p ≥ . between virtual source-type and expansion-type were found. PW-CF PW-IRLS MW-CF MW-IRLS020406080100 S c o r e Median ScorePW-CF Confidence Interval (a) Source localization scores with speech sound-source.
PW-CF PW-IRLS MW-CF MW-IRLS020406080100 S c o r e Median ScorePW-CF Confidence Interval (b) Basic audio quality scores with speech sound-source.
PW-CF PW-IRLS MW-CF MW-IRLS020406080100 S c o r e Median ScorePW-CF Confidence Interval (c) Source localization scores with music sound-source.
PW-CF PW-IRLS MW-CF MW-IRLS020406080100 S c o r e Median ScorePW-CF Confidence Interval (d) Basic audio quality scores with music sound-source.
Fig. 3: Box plot of perception experiment scores for source localization (a) and (c), and basic audio quality (b) and (d). Eachbox bounds the interquartile range (IQR) with the center bar indicating the median score, and the whiskers extended to amaximum of . × IQR. The v -shaped notches in the box refer to the confidence interval. When the notches between twoboxes do not overlap, it can be concluded with confidence that the true medians differ.
4) Summary and discussion:
The proposed MW-IRLSmethod showed an improvement against the PW-CF bench-mark in the perceptual criteria of source localization and audioquality for both a speech and music source. Furthermore,MW-CF also received higher mean scores when reconstructinghuman speech, and higher median scores for music. Whencomparing virtual expansion-types, the IRLS expansion wasseen to have better quality robustness and localizability for aspeech source, but not a music source. This may be explainedby the IRLS matching the sparseness of the single human’sspeech, but not the natural sound of music which is normallygenerated by multiple sound-sources. Nonetheless, this paperfocuses on the modeling of secondary virtual sources. Nointeraction effect between the source model and expansion-type was found. This indicates that the strong perceptualresults achieved by mixedwave methods were not dependenton the expansion-type, and are instead an outcome of the near- field and far-field virtual source mixture. In the next section,we conduct a simulation analysis on the sound fields used inthis experiment to gain further insight on properties that mayhave influenced these strong perceptual results.V. S
IMULATION A NALYSIS
In this section we simulate the same virtual environmentsthat were used in the perception test (Section IV-A3). Weexamine their pressure and intensity fields to identify factorsthat may correlate with perceptual performance.
A. Error Metrics
We define the pressure error (PE) and intensity magnitudeerror (IME) between the true and reproduced sound field as (cid:32) PE = | P − ˜ P | | P | , IME = || I − ˜ I || || I || (cid:33) × . (36) Fig. 4: (a) True pressure field and (b) intensity field at
Hzin the XY-plane with the point-source at (1 , , m, whereintensity magnitude is given by the color-map.Fig. 5: Truncated measurement of (a) the pressure field and (b)the intensity field at Hz in the XY-plane for the point-source at (1 , , m, where (c) is PE and (d) is IDE.where I = Re ( P V ∗ ) , and V ∗ is the conjugated sound fieldvelocity. The intensity direction error (IDE), which is denotedas the acute angle between the true recorded and reproducedintensity fields [61], is expressed asIDE = arccos (cid:32) I · ˜ I || I || · || ˜ I || (cid:33) /π × . (37)Additionally, for intensity fields, we also illustrate the true andreproduced intensity unit vector difference = I / || I ||− ˜ I / || ˜ I || . B. Pressure and Intensity Fields
Figure 4 shows the pressure and intensity field for the truesound-source at (1 , , m that we recorded and reproducedvirtually in the perception experiment. The th order recordingof this true sound-source is shown in Fig. 5. Immediatelywe observe the effects of truncation in the recorded pressurefield (Fig. 5a), where a distinct near-field pattern is no longervisible. As expected, the recording is seen to be localized Fig. 6: PW-CF reproduction of (a) pressure field and (b)intensity field at Hz in the XY-plane, where (c) is thereproduction PE and (d) is the reproduction IDE.Fig. 7: MW-IRLS reproduction of (a) pressure field and (b)intensity field at
Hz in the XY-plane, where (c) is thereproduction PE and (d) is the reproduction IDE.spatially within the microphone array, illustrated by the sweet-spot within the PE (Fig. 5c). Similarly, the recorded intensityis seen to also be concentrated about the sweet-spot. Beyondthe sweet-spot, truncation error is seen to degrade the pressureand intensity accuracy, leading to the perceptual artifacts wewish to resolve by extrapolating a virtual source environment.In Fig. 6, we observe that the PW-CF experiences the samesweet-spot behaviors as the truncated recording, where onceagain the reproduced pressure and intensity is localized tothe microphone’s region (Fig. 6c). A similar result is alsoobtained by the MW-CF method (not shown), supporting that -2 PE [ % ]
500 Hz
Measurement PW-CF PW-IRLS MW-CF MW-IRLS
Radius [m] -2 PE [ % ] Radius [m]
Fig. 8: Average pressure error over a spherical surface ofvarying radius at four frequencies for the measured andreproduced sound fields. Frequency [Hz] I M E [ % ] MeasurementPW-CFPW-IRLSMW-CFMW-IRLS
Fig. 9: Average intensity magnitude error over a . m spher-ical surface plotted against frequency.the sweet-spot is caused by the closed-form expansion over-approximating the truncated recording. The PW-CF intensityfield is also seen to be non-uniform throughout the virtualenvironment. It is expected that this may be a dominant factorcontributing to the PW-CF’s perceptual evaluation.Figure 7 shows better results for the MW-IRLS repro-duction. As intended, IRLS expansion is seen to relax thesweet-spot constraint (Fig. 7a & c). Similar results are alsoobserved for the PW-IRLS method (not shown), indicating thatsparse expansions are able to extend the region of reproductionaccuracy. This is believed to aid the perceptual stability of thereproduction as the listener translates further from the originalrecording position. Furthermore, the MW-IRLS intensity field(Fig. 7b) is shown to have improved uniformity, which leadsto better IDE results (Fig. 7d). This uniformity is expected tohave contributed to the strong perceptual results achieved bythe MW-IRLS method. C. Pressure and Intensity Error
We present the averaged PE at various translation distancesin Fig. 8. A clear difference in performance is observed at thelower frequencies, where the two IRLS expansions (PW-IRLS Frequency [Hz] I D E [ % ] MeasurementPW-CFPW-IRLSMW-CFMW-IRLS
Fig. 10: Average intensity direction error over a . m spher-ical surface plotted against frequency. -40-2002040 D i ff e r en c e [ d B ] left PW-CFMW-IRLS Frequency [Hz] -40-2002040 D i ff e r en c e [ d B ] right PW-CFMW-IRLS
Fig. 11: BRIR spectral difference between the true (reference)and reproduced (PW-CF, MW-IRLS) signals rendered at thetranslated position (0 , . , m.and MW-IRLS) are seen to better reproduce the pressure fieldthroughout a . m region. This result corroborates with theprior sweet-spot observations, where the IRLS expansions areable to relax spatial constraints. On the other hand, the closed-form expansions are shown to match the PE of the recording,further illustrating that the PW-CF and MW-CF methods over-approximate the truncation artifacts.All methods are observed to have poor IME at the transla-tion of . m in Fig. 9. At higher frequencies both MW-CF andMW-IRLS have lower error than their planewave counterparts.However, the IME is still poor, and it is difficult to know if thisbehavior contributed to perceptual results. Additionally, largespikes in error are found when the microphone’s truncationincreases between the (cid:100) k | x Q |(cid:101) frequency bands. It may bepossible to smooth the activation of each band to furtherimprove perceptual stability.The IDE shows clearer results at the . m translation inFig. 10. The MW-IRLS reproduction is seen to strongly matchthe direction of the true sound-source’s intensity across the fullfrequency range. This intensity alignment is expected to havecontributed to the perceptual results of the MW-IRLS method.This is in contrast to the PW-CF benchmark which is seen tofollow the recording’s poor IDE at lower frequencies. D. BRIR Response
We measure the reproduction system’s response by record-ing, expanding, and auralizing a sine-sweep signal with theplanewave and mixedwave translation methods. This providesthe binaural room impulse response (BRIR) of the translatedlistener in the virtual environment. Figure 11 gives the BRIRspectra difference of the PW-CF and MW-IRLS comparedto the reference at the translated position of (0 , . , m tothe left. The BRIR spectral results show the most substantialdifference between the PW-CF and MW-IRLS methods thusfar. Below Hz the MW-IRLS is observed to have littlespectral deviation from the reference BRIR. This suggests thatthe MW-IRLS accurately reconstructs the sound heard by atranslated listener in the true environment. The PW-CF BRIR,however, is seen to deviate significantly from the reference.Therefore, the BRIR spectra differences support that the MW-IRLS offers greater perceptual accuracy, which was observedin the perceptual experiment results.The BRIR includes the effects of HRTF processing thatis applied to each sound field translation method. This mayexplain why there is such a large disparity between theplanewave and mixedwave BRIR. The mixedwave methodadapts the HRTF with the listener’s movement. While theplanewave method maintains a constant HRTF perspective andshifts the truncated reproduction about the listener. It is thedifference between these two HRTF implementations that mayhave the most significant effect on the BRIR differences andthe perceptual experiment results.VI. C
ONCLUSION
Virtual reality technology enhances acoustic real-worldreproductions by allowing listeners to perceptually moveabout the environment. At this time, however, the benchmarkplanewave method towards sound field translation is stilllimited by inherited microphone constraints. Furthermore, theplanewave source model is restricted to the far-field, whichresults in the listener’s HRTF perceptive being fixed duringtranslation. As a result, immersion in the planewave environ-ment is degraded by poor source localizability and audiblespectral distortions.We have proposed an alternative source model for trans-lation that enables a sparse virtual environment to contain amixture of near-field and far-field sources. We compared thisproposed mixedwave method against the planewave bench-mark through a perceptual MUSHRA experiment and cross-examined the results with numerical simulations. For humanspeech reproduction, the mixedwave source model improvedboth localizability and audio quality. Both the closed-formand IRLS expanded mixedwave reproductions were found toprovide a more immersive experience. Similar results werealso found for a wider band music sound source.The IRLS expansion was shown to help enlarge the re-production sweet-spot, and in response it scored better per-ceptually than the closed-form expansion for speech sound.Additionally, the proposed method showed better intensity di-rection matching than the benchmark, further corroborating theperceptual results. Finally, we illustrated that the mixedwave’s ambisonic-like binaural rendering allows for greater perceptualaccuracy due to lower BRIR spectral error.We note that this paper focuses on the sole effects ofmodeling a near-field far-field mixture for translation. Assuch, we studied an over-simplified acoustic environment inorder to make clearer comparisons. We leave the consid-erations involved with implementing the proposed methodas future work. Accounting for acoustic reflections, diffusesound, multiple sound sources, source directivity, and themethods to separate and process them in a virtual mixedwaveenvironment are left as an open problem. Furthermore, it maybe satisfying to reproduce the recorded experience along withsynthetic sounds, and it should be explored along side futureapplications developed for virtual reality.VII. T
HANKS
The authors would like to thank Zamir Ben-Hur for guid-ance in developing the perceptual test, and Shawn Featherlyfor development of the perceptual test Unity application.R
EFERENCES[1] P. Dodds, S. Amengual Gar´ı, W. Brimijoin, and P. Robinson, “Auraliza-tion systems for simulation of augmented reality experiences in virtualenvironments,” in
Audio for Virtual, Augmented and Mixed Realities:Proc. ICSA 2019; 5th Intl. Conf. on Spatial Audio , 2019, pp. 29–34.[2] Y. Suzuki et al. , “3d spatial sound systems compatible with human’sactive listening to realize rich high-level kansei information,”
Interdis-ciplinary information sciences , vol. 18, no. 2, pp. 71–82, 2012.[3] J. G. Tylka and E. Y. Choueiri, “Models for evaluating navigationaltechniques for higher-order ambisonics,” in
Proc. Meetings on Acoust.
ASA, 2017, vol. 30, p. 050009.[4] S. Amengual Gar´ı, C. Schissler, R. Mehra, S. Featherly, and P. Robinson,“Evaluation of real-time sound propagation engines in a virtual realityframework,” in
Proc. Intl. Audio Eng. Soc. Conf. on Immersive andInteractive Audio . Audio Engineering Society, 2019.[5] M. Ziegler et al. , “Immersive virtual reality for live-action video usingcamera arrays,”
IBC, Amsterdam, Netherlands , 2017.[6] D. Rivas M´endez, C. Armstrong, J. Stubbs, M. Stiles, and G. Kearney,“Practical recording techniques for music production with six-degreesof freedom virtual reality,” in
Audio Eng. Soc. Conv. 145 . AudioEngineering Society, 2018.[7] C. D. Salvador, S. Sakamoto, J. Trevino, and Y. Suzuki, “Spatialaccuracy of binaural synthesis from rigid spherical microphone arrayrecordings,”
Acoust. Sci. Technol. , vol. 38, no. 1, pp. 23–30, 2017.[8] J. G. Tylka and E. Y. Choueiri, “Fundamentals of a parametric methodfor virtual navigation within an array of ambisonics microphones,”
J.Audio Eng. Soc. , vol. 68, no. 3, pp. 120–137, 2020.[9] J. G. Tylka and E. Y. Choueiri, “Performance of linear extrapolationmethods for virtual sound field navigation,”
J. Audio Eng. Soc. , vol. 68,no. 3, pp. 138–156, 2020.[10] N. Mariette and B. Katz, “Sounddeltalarge scale multi-user audioaugmented reality,” in
Proc. of the EAA Symposium on Auralization ,2009, pp. 15–17.[11] J. G. Tylka and E. Y. Choueiri, “Domains of practical applicability forparametric interpolation methods for virtual sound field navigation,”
J.Audio Eng. Soc. , vol. 67, no. 11, pp. 882–893, 2019.[12] E. Patricio, A. Ruminski, A. Kuklasinski, L. Januszkiewicz, and T. Zer-nicki, “Toward six degrees of freedom audio recording and playbackusing multiple ambisonics sound fields,” in
Audio Eng. Soc. Conv. 146 .Audio Engineering Society, 2019.[13] P. Samarasinghe, T. Abhayapala, and M. Poletti, “Wavefield analysisover large areas using distributed higher order microphones,”
IEEE/ACMTrans. Audio, Speech, Language Process. , vol. 22, no. 3, pp. 647–658,2014.[14] J. G. Tylka and E. Choueiri, “Soundfield navigation using an array ofhigher-order ambisonics microphones,” in
Proc. Intl. Audio Eng. Soc.Conf. on Audio for Virtual and Augmented Reality . Audio EngineeringSociety, 2016. [15] Y. Wang and K. Chen, “Translations of spherical harmonics expansioncoefficients for a sound field using plane wave expansions,” J. Acoust.Soc. Amer. , vol. 143, no. 6, pp. 3474–3478, 2018.[16] O. Thiergart, G. Del Galdo, M. Taseska, and E. Habets, “Geometry-based spatial sound acquisition using distributed microphone arrays,”
IEEE Trans. Audio, Speech, Language Process. , vol. 21, no. 12, pp.2583–2594, 2013.[17] J. G. Tylka and E. Choueiri, “Comparison of techniques for binauralnavigation of higher-order ambisonic soundfields,” in
Audio Eng. Soc.Conv. 139 . Audio Engineering Society, 2015.[18] M. Noisternig, A. Sontacchi, T. Musil, and R. Holdrich, “A 3d ambisonicbased binaural sound reproduction system,” in
Proc. of 24th Intl.Audio Eng. Soc. Conf. on Multichannel Audio, The New Reality . AudioEngineering Society, 2003.[19] D. Menzies and M. Al-Akaidi, “Ambisonic synthesis of complexsources,”
J. Audio Eng. Soc. , vol. 55, no. 10, pp. 864–876, 2007.[20] T. Pihlajamaki and V. Pulkki, “Synthesis of complex sound scenes withtransformation of recorded spatial sound in virtual reality,”
J. AudioEng. Soc. , vol. 63, no. 7/8, pp. 542–551, 2015.[21] E. Fernandez-Grande, “Sound field reconstruction using a sphericalmicrophone array,”
J. Acoust. Soc. Amer. , vol. 139, no. 3, pp. 1168–1178, 2016.[22] F. Schultz and S. Spors, “Data-based binaural synthesis includingrotational and translatory head-movements,” in
Proc. of 52nd Intl. AudioEng. Soc. Conf. on Sound Field Control-Engineering and Perception .Audio Engineering Society, 2013.[23] R. Duraiswami, Z. Li, D. N. Zotkin, E. Grassi, and Nail A. G., “Plane-wave decomposition analysis for spherical microphone arrays,” in
Proc.IEEE Workshop Appl. Signal Process. Audio Acoust.
IEEE, 2005, pp.150–153.[24] M. A. Poletti, “Three-dimensional surround sound systems based onspherical harmonics,”
J. Audio Eng. Soc. , vol. 53, no. 11, pp. 1004–1025, 2005.[25] D. B. Ward and T. D. Abhayapala, “Reproduction of a plane-wavesound field using an array of loudspeakers,”
IEEE Trans. Speech AudioProcess. , vol. 9, no. 6, pp. 697–707, 2001.[26] MH Acoustics, “Em32 eigenmike microphone array release notes (v17.0),”
25 Summit Ave, Summit, NJ 07901, USA , 2013.[27] N. Hahn and S. Spors, “Modal bandwidth reduction in data-basedbinaural synthesis including translatory head-movements,” in
Proc.German Annu. Conf. Acoust.(DAGA) , 2015, pp. 1122–1125.[28] N. Hahn and S. Spors, “Physical properties of modal beamforming inthe context of data-based sound reproduction,” in
Audio Eng. Soc. Conv.139 . Audio Engineering Society, 2015.[29] A. Kuntz and R. Rabenstein, “Limitations in the extrapolation of wavefields from circular measurements,” in
Eur. Signal Process. Conf.
IEEE,2007, pp. 2331–2335.[30] F. Winter, F. Schultz, and S. Spors, “Localization properties of data-based binaural synthesis including translatory head-movements,” in
Proceedings of the Forum Acusticum, Krakow, Poland , 2014, vol. 31.[31] J. Daniel, “Spatial sound encoding including near field effect: Intro-ducing distance coding filters and a viable, new ambisonic format,” in
Proc. of 23rd Intl. Audio Eng. Soc. Conf. on Signal Processing in AudioRecording and Reproduction . Audio Engineering Society, 2003.[32] K. Wakayama, J. Trevino, H. Takada, S. Sakamoto, and Y. Suzuki, “Ex-tended sound field recording using position information of directionalsound sources,” in
Proc. IEEE Workshop Appl. Signal Process. AudioAcoust.
IEEE, 2017, pp. 185–189.[33] A. Plinge, S. Schlecht, O. Thiergart, T. Robotham, O. Rummukainen,and E. Habets, “Six-degrees-of-freedom binaural audio reproduction offirst-order ambisonics with distance information,” in
Proc. Intl. AudioEng. Soc. Conf. on Audio for Virtual and Augmented Reality . AudioEngineering Society, 2018.[34] M. Kentgens, A. Behler, and P. Jax, “Translation of a higher orderambisonics sound scene based on parametric decomposition,” in
Proc.IEEE Int. Conf. Acoust., Speech, Signal Process. , 2020, pp. 151–155.[35] D. Menzies and M. Al-Akaidi, “Nearfield binaural synthesis andambisonics,”
J. Acoust. Soc. Amer. , vol. 121, no. 3, pp. 1559–1563,2007.[36] Zylia Sp. z o.o., “ZYLIA ZM-1 Microphone,” , accessed: Feb. 2020.[37] VisiSonics Corporation, “Visisonics 5/64 audio/visual camera,” https://visisonics.com/564avcamera/ , accessed: Feb. 2020.[38] R. Chartrand and W. Yin, “Iteratively reweighted algorithms forcompressive sensing,” in
IEEE Intl. Conf. Acoust., Speech, SignalProcess.
IEEE, 2008, pp. 3869–3872. [39] S. Emura, “Sound field estimation using two spherical microphonearrays,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
IEEE,2017, pp. 101–105.[40] Y. Hu, P. N. Samarasinghe, T. D. Abhayapala, and G. Dickins, “Mod-eling characteristics of real loudspeakers using various acoustic models:Modal-domain approaches,” in
Proc. IEEE Int. Conf. Acoust., Speech,Signal Process.
IEEE, 2019, pp. 561–565.[41] Y. Maeno, Y. Mitsufuji, and T. D. Abhayapala, “Mode domain spatialactive noise control using sparse signal representation,” in
Proc. IEEEInt. Conf. Acoust., Speech, Signal Process.
IEEE, 2018, pp. 211–215.[42] L. Birnie, T. Abhayapala, P. Samarasinghe, and V. Tourbabin, “Soundfield translation methods for binaural reproduction,” in
Proc. IEEEWorkshop Appl. Signal Process. Audio Acoust.
IEEE, 2019, pp. 140–144.[43] ITU Radiocommunication Assembly, “ITU-R BS.1534-3: Method forthe subjective assessment of intermediate quality level of audio systems,”October 2015.[44] Z. Ben-Hur, D. L. Alon, B. Rafaely, and R. Mehra, “Loudness stabilityof binaural sound with spherical harmonic representation of sparse head-related transfer functions,”
EURASIP J. Audio, Speech, Music Process. ,vol. 2019, no. 1, pp. 5, 2019.[45] E.G. Williams,
Fourier Acoustics: Sound Radiation and NearfieldAcoustic Holography , Academic Press, London, UK, 1999.[46] H. Chen, T. D. Abhayapala, and W. Zhang, “Theory and design ofcompact hybrid microphone arrays on two-dimensional planes for three-dimensional soundfield analysis,”
J. Acoust. Soc. Amer. , vol. 138, no. 5,pp. 3081–3092, 2015.[47] VisiSonics Corporation, “Visisonics audio/visual planar array,” https://visisonics.com/audio-visual-planar-array/ , accessed: Feb. 2020.[48] T. D. Abhayapala and D. B. Ward, “Theory and design of high ordersound field microphones using spherical microphone array,” in
Proc.IEEE Int. Conf. Acoust., Speech, Signal Process.
IEEE, 2002, vol. 2,pp. II–1949.[49] B. Rafaely, “Analysis and design of spherical microphone arrays,”
IEEETrans. Speech Audio Process. , vol. 13, no. 1, pp. 135–143, 2005.[50] R. A. Kennedy, P. Sadeghi, T. D. Abhayapala, and H. M. Jones, “Intrinsiclimits of dimensionality and richness in random multipath fields,”
IEEETrans. Signal Process. , vol. 55, no. 6, pp. 2542–2556, 2007.[51] D. N. Zotkin, R. Duraiswami, and N. A. Gumerov, “Regularized hrtffitting using spherical harmonics,” in
Proc. IEEE Workshop Appl. SignalProcess. Audio Acoust.
IEEE, 2009, pp. 257–260.[52] W. Zhang, T. D. Abhayapala, R. A. Kennedy, and R. Duraiswami,“Insights into head-related transfer function: Spatial dimensionality andcontinuous representation,”
J. Acoust. Soc. Amer. , vol. 127, no. 4, pp.2347–2357, 2010.[53] G. N. Lilis, D. Angelosante, and G. B. Giannakis, “Sound fieldreproduction using the lasso,”
IEEE Trans. Audio, Speech, LanguageProcess. , vol. 18, no. 8, pp. 1902–1912, 2010.[54] R. Tibshirani, “Regression shrinkage and selection via the lasso,”
J.Roy. Stat. Soc. Ser. B , vol. 58, no. 1, pp. 267–288, 1996.[55] E. J. Cand`es and M. B. Wakin, “An introduction to compressivesampling,”
IEEE Signal Process. Mag. , vol. 25, no. 2, pp. 21–30, 2008.[56] A. Lindau, T. Hohn, and S. Weinzierl, “Binaural resynthesis forcomparative studies of acoustical environments,” in
Audio Eng. Soc.Conv. 122 . Audio Engineering Society, 2007.[57] F. Brinkmann et al. , “A cross-evaluated database of measured andsimulated hrtfs including 3d head meshes, anthropometric features, andheadphone impulse responses,”
J. Audio Eng. Soc. , vol. 67, no. 9, pp.705–718, 2019.[58] F. Brinkmann et al. , “The hutubs head-related transfer function (hrtf)database,” [online]. http://dx.doi.org/10.14279/depositonce-8487 , ac-cessed: Feb. 2020.[59] B. Rafaely and M. Kleider, “Spherical microphone array beam steeringusing wigner-d weighting,”
IEEE Signal Process. Lett. , vol. 15, pp.417–420, 2008.[60] J. Fliege and U. Maier, “The distribution of points on the sphere andcorresponding cubature formulae,”
IMA J. Numer. Anal. , vol. 19, no. 2,pp. 317–334, 1999.[61] M. Shin, P. A. Nelson, F. M. Fazi, and J. Seo, “Velocity controlled soundfield reproduction by non-uniformly spaced loudspeakers,”