Deep Sound Field Reconstruction in Real Rooms: Introducing the ISOBEL Sound Field Dataset
Miklas Strøm Kristoffersen, Martin Bo Møller, Pablo Martínez-Nuevo, Jan ?stergaard
DDeep Sound Field Reconstruction in Real Rooms:Introducing the ISOBEL Sound Field Dataset
Miklas Strøm Kristoffersen,
1, 2
Martin Bo Møller, Pablo Mart´ınez-Nuevo, and Jan Østergaard Research Department, Bang & Olufsen a/s, Struer, Denmark AI and Sound Section, Department of Electronic Systems, Aalborg University, Aalborg, Denmark
Knowledge of loudspeaker responses are useful in a number of applications, where a soundsystem is located inside a room that alters the listening experience depending on positionwithin the room. Acquisition of sound fields for sound sources located in reverberant roomscan be achieved through labor intensive measurements of impulse response functions coveringthe room, or alternatively by means of reconstruction methods which can potentially requiresignificantly fewer measurements. This paper extends evaluations of sound field reconstruc-tion at low frequencies by introducing a dataset with measurements from four real rooms.The ISOBEL Sound Field dataset is publicly available, and aims to bridge the gap betweensynthetic and real-world sound fields in rectangular rooms. Moreover, the paper advances ona recent deep learning-based method for sound field reconstruction using a very low numberof microphones, and proposes an approach for modeling both magnitude and phase responsein a U-Net-like neural network architecture. The complex-valued sound field reconstructiondemonstrates that the estimated room transfer functions are of high enough accuracy toallow for personalized sound zones with contrast ratios comparable to ideal room transferfunctions using 15 microphones below 150 Hz.
The following article has been submitted to the Journal of the Acoustical Society of America.After it is published, it will be found at http://asa.scitation.org/journal/jas.
I. INTRODUCTION
The response of a sound system in a room primarilyvaries with the room itself, the position of the loudspeak-ers, and the listening position. In order to deliver theintended sound system behavior to listeners, it is neces-sary to know about and compensate for this effect. Ap-plications include among others room equalization (Cec-chi et al. , 2018; Karjalainen et al. , 2001; Radlovic et al. ,2000), virtual reality sound field navigation (Tylka andChoueiri, 2015), source localization (Nowakowski et al. ,2017), and spatial sound field reproduction over prede-fined or dynamic regions of space also referred to assound zones (Betlehem et al. , 2015; Møller and Øster-gaard, 2020). An approach to achieve this, is to measurethe loudspeaker response at the desired listening loca-tions and adjust the sound system accordingly. However,the task of measuring impulse responses on a sufficientlyfine-grained grid in an entire room, quickly poses as atime-consuming and extensive manual labor that is notdesirable. Instead, methods have been developed for thepurpose of estimating impulse responses in a room basedon a limited number of actual measurements. Thesemethods are also referred to as sound field reconstruc-tion and virtual microphones. The task of reconstruct-ing room impulse responses in positions that have notbeen measured directly, is an active research field whichhas been explored in several studies (Ajdler et al. , 2006;Antonello et al. , 2017; Fernandez-Grande, 2019; Mignot et al. , 2014; Verburg and Fernandez-Grande, 2018; Vuand Lissek, 2020).Machine learning, and in particular deep learning,is currently receiving widespread attention across scien- tific domains, and as an example within room acoustics,it has been used to estimate acoustical parameters ofrooms (Genovese et al. , 2019; Yu and Kleijn, 2021). In re-cent work, deep learning-based methods were introducedto sound field reconstruction in reverberant rectangularrooms (Llu´ıs et al. , 2020). This data-driven approach isable to learn sound field magnitude characteristics fromlarge scale volumes of simulated data without prior infor-mation of room characteristics, such as room dimensionsand reverberation time. The method is computationallyefficient, and works with irregularly and arbitrarily dis-tributed microphones for which there is no requirementof knowing absolute locations in the Euclidean space, incontrast to previous solutions. Furthermore, the recon-struction proves to work with a very low number of micro-phones, making real-world implementation feasible. Toassess the issue of real-world sound field reconstruction,the method is evaluated using measurements in a singleroom (Llu´ıs et al. , 2020). However, it is still unknownhow much knowledge is transferred from the simulated tothe real environment, as well as how well the model gen-eralizes to different real rooms. This is a general problemin deep learning applications that rely on labor intensivedata collections, which is our motivation for publishingan open access dataset of real-world sound fields in adiverse set of rooms.This paper studies sound field reconstruction at lowfrequencies in rectangular rooms with a low number ofmicrophones. The main contributions are: • This paper introduces a sound field dataset, whichis publicly available for development and evaluationof sound field reconstruction methods in four realrooms. It is our hope that the ISOBEL Sound Field
Deep Sound Field Reconstruction in Real Rooms: Introducing the ISOBEL Sound Field Dataset 1 a r X i v : . [ c s . S D ] F e b ataset will help the community in benchmarkingand comparing state-of-the-art results. • We assess the real-world performance of deeplearning-based sound field magnitude reconstruc-tion trained on simulated sound fields. For thispurpose, we consider low frequencies, since low-frequency room modes can significantly alter lis-tening experience.Furthermore, we are interested inusing a very low number of microphones. • Moreover, we extend the deep learning-based soundfield reconstruction to cover complex-valued inputs,i.e. both the magnitude and the phase of a soundfield. Evaluation is performed in both simulatedand real rooms, where a performance gap is ob-served. We argue why complex sound field recon-struction may have more difficulties in transferringuseful knowledge from synthetic to real data. • Lastly, we demonstrate the application of complex-valued sound field reconstruction within the fieldof sound zone control. Specifically, it is shown thatsound fields reconstructed from as little as five mi-crophones pose as valuable inputs to acoustic con-trast control.The paper is organized as follows: Section II intro-duces the concept of sound field reconstruction. Detailsof measurements from real rooms are presented in Sec-tion III. In Section IV, we focus on the problem of recon-structing the magnitude of sound fields, while Section Vextends the model to complex-valued sound fields. Fi-nally, Section VI investigates the application of soundzones through sound field reconstruction.
II. SOUND FIELD RECONSTRUCTION
Our approach towards the sound field reconstructionproblem is based on the observation that acoustic pres-sure in a room can be described using a three-dimensionalregular grid of points defining a three-dimensional dis-crete function. The approach specifically for the purposeof magnitude reconstruction was introduced in (Llu´ıs et al. , 2020). First, let R = [0 , l x ] × [0 , l y ] × [0 , l z ] denotea rectangular room, where l x , l y , l z > D o . However, for the sake of simplicity, we reduce thethree-dimensional problem to a two-dimensional recon-struction on horizontal planes. The two-dimensional gridwith a constant height z o is defined as D o := (cid:110)(cid:16) i l x I − , j l y J − , z o (cid:17)(cid:111) i,j (1)for z o ∈ [0 , l z ], i = 0 , . . . , I − j = 0 , . . . , J −
1, and in-tegers
I, J ≥
2. Note, though, that the dataset collectedfor this study, which we will introduce in Section III,does in fact contain multiple horizontal planes at differentheights. We keep the investigations of three-dimensional reconstruction for future work, and frame the core chal-lenge of this paper as estimation of sound pressure intwo-dimensional horizontal planes.The function that we seek to reconstruct on this gridis the Fourier transform of the sound field in a frequencyband that covers the low frequencies. The complex-valued frequency-domain sound field calculated using theFourier transform is given by s ( r , ω ) := (cid:90) R p ( r , t ) e − jωt d t (2)where ω ∈ R is a given excitation frequency, and p ( r , t )denotes the spatio-temporal sound field with r ∈ R . Werefer to the real and imaginary parts of the sound fieldusing s Re ( r , ω ) and s Im ( r , ω ), respectively. Note that s is defined as the magnitude of the Fourier transformin (Llu´ıs et al. , 2020). Instead, for magnitude recon-struction, we introduce the magnitude of the sound field | s ( r , ω ) | := (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) R p ( r , t ) e − jωt d t (cid:12)(cid:12)(cid:12)(cid:12) (3)for ω ∈ R and r ∈ R .The procedure for reconstructing s ( r , ω ) on D o takesits starting point from actual observations of the soundfield in select positions of the grid. We refer to the col-lected set of these available sample points as S o , whichwe further define to be a subset of the full grid. That is, S o ⊆ D o . The cardinality |S o | of the set S o is the num-ber of available sample points, which we will also refer toas the number of microphones n mic in later experiments.We define the samples available to the reconstruction al-gorithm as { s ( r , ω ) } r ∈S o ⊆D o . (4)An important aspect of these definitions is that thegrid is unitless and positions can be defined in relativeterms. That is, when sampling a point in the grid, onlythe relative position within the grid, and hence the room,needs to be known. This allows us to relax the datacollection compared to alternative methods that requireabsolute locations. Another important element to con-sider is that the sampling pattern of S o can form anyarrangement within D o as long as 1 ≤ |S o | ≤ |D o | . As anexample, this means that sampled points can be irregu-larly distributed spatially in a room.Situations may arise where the sound field resolu-tion, as defined by l x , I , l y , and J , is too coarse. As anexample, consider rooms that are either very long, wide,or in general large. Another example includes applica-tions where fine-grained variations within a sound fieldare of importance. To compensate for this effect, we al-low the reconstruction to base its output on another gridthan D o . Such domain will typically be an upsamplingof the original grid, but similarly it can be defined withother transformations, e.g. downsampling. Specifically,we define the grid as D L,Po := (cid:110)(cid:16) i l x IL − , j l y JP − , z o (cid:17)(cid:111) i,j (5) here i = 0 , . . . , IL − j = 0 , . . . , JP −
1, and
L, P must be chosen such that
IL, JP ∈ Z + . Note that a valuelarger than one for either L or P results in an upsamplingin the respective dimension.The task of the sound field reconstruction is then toestimate the sound field on the grid D L,Po based on thesampled points S o . In particular, the objective of thereconstruction algorithm is to learn parameters w given g w : C |S o | K → C |D L,Po | K (6) { s ( r , ω k ) } r ∈S o ,ω k ∈ Ω (cid:55)→ { ˆ s ( r , ω k ) } r ∈D L,Po ,ω k ∈ Ω where g w is an estimator and Ω = { ω k } Kk =1 is the set offrequencies at which the sound field will be reconstructed.The remainder of the paper describes the procedure forlearning parameters w using deep learning-based meth-ods. A. Evaluation Metrics
The successfulness of the estimator is quantitativelyjudged using normalized mean square error (NMSE) ateach frequency point in { ω k } Kk =1 NMSE k = (cid:80) r ∈D L,Po | s ( r , ω k ) − ˆ s ( r , ω k ) | (cid:80) r ∈D L,Po | s ( r , ω k ) | . (7)The NMSE provides an average error over all positions inthe grid between reconstructed and original sound fieldsfor a single room at a single frequency. We also introducean average NMSE, which is the NMSE performance av-eraged over all frequencies of interest as well as over allrealizations from M trials, e.g. multiple roomsMNMSE =1 M K M (cid:88) m =1 K (cid:88) k =1 (cid:80) r ∈D L,Po | s m ( r , ω k ) − ˆ s m ( r , ω k ) | (cid:80) r ∈D L,Po | s m ( r , ω k ) | . (8)This measure serves as an overall indication of the ac-curacy of a model, whereas the NMSE k allows a deeperinsight of model behaviors at different frequencies. Notethat the M trials are specific to each experiment and willbe described accordingly. III. THE ISOBEL SOUND FIELD DATASET
A major contribution of this paper is the ISOBELSound Field dataset, which is released as open accessalongside the manuscript. The intended purpose is touse the measurements from real rooms for evaluation ofsound field reconstruction in a diverse set of rooms. Notethat the room-wide measurements of room impulse re-sponses have several other use-cases that will not be fur-ther investigated in this paper, but we encourage the useoutside sound field reconstruction as well. This sectiondetails the dataset and the measurement procedure. The dataset consists of measurements from four dif-ferent rooms as specified in Table I and depicted in Fig. 1.The data collection is an extension to the real room mea-sured in (Llu´ıs et al. , 2020), which is included in the ISO-BEL Sound Field dataset as Room B for simple access toall measured rooms. The rooms are located at AalborgUniversity, Aalborg, Denmark, and Bang & Olufsen a/s,Struer, Denmark. The rooms have significantly differentacoustic properties and also vary in size. Two types ofmeasurements are conducted in each room: 1) Reverber-ation time; 2) Sound field. However, only the sound fieldmeasurements are released as part of the dataset.The reverberation times are measured in conformitywith ISO 3382-2 (ISO 3382-2:2008, 2008) and calculatedbased on resulting impulse responses using backwards in-tegration and least-squares best fit evaluation of the de-cay curves. The reverberation times reported in thetable are the arithmetic averages of 1/3 octave T esti-mates in the frequency range 50-316 Hz.The sound field measurements are performed on a32 by 32 grid with sample points distributed uniformlyalong the length and width of each room. That is, atotal of 1024 positions are measured in each room if pos-sible, but in some cases it is not feasible to measure allpositions due to e.g. obstacles. The horizontal gridsare measured at four different heights: 1, 1.3, 1.6, and1.9 meters above the floor. This is achieved using themicrophone rig depicted in Fig. 1. Two 10 inch loud-speakers are used to acquire sound fields from two dif-ferent source positions in each room. Both loudspeakersare placed on the floor, one in a corner and one in anarbitrary position. The sound sources are kept in thesame position, while the microphones are moved aroundthe room to record impulse responses. For each micro-phone position in the grid, the two sources play logarith-mic sine sweeps in the frequency range 0.1-24,000 Hz fol-lowed by a quiet tail, (Farina, 2000). We use a samplingfrequency of 48,000 Hz. The equipment includes amongothers four G.R.A.S. 40AZ prepolarized free-field micro-phones connected to four G.R.A.S. 26CC CCP standardpreamplifiers and an RME Fireface UFX+ sound card.The four microphones are level calibrated at 1,000 Hzusing a Br¨uel & Kjær sound calibrator type 4231 priorto the measurements.
TABLE I. Room characteristics in the ISOBEL Sound Fielddataset. The reverberation times are the arithmetic averagesof 1/3 octave T estimates in the frequency range 50-316Hz. Room Dim. [m] Size [m /m ] T [s] Room B 4.16 x 6.46 x 2.30 27/ 62 0.39VR Lab 6.98 x 8.12 x 3.03 57/172 0.37List. Room 4.14 x 7.80 x 2.78 32/ 90 0.80Prod. Room 9.13 x 12.03 x 2.60 110/286 0.77
Deep Sound Field Reconstruction in Real Rooms: Introducing the ISOBEL Sound Field Dataset 3
IG. 1. Left: Rig with four microphones. Rooms from top left to bottom right: Room B, VR Lab, Listening Room, andProduct Room.
IV. SOUND FIELD MAGNITUDE RECONSTRUCTION
In the previous sections we have introduced the prob-lem of reconstructing sound fields on two-dimensionalgrids in rectangular rooms, as well as introduced a real-world dataset specifically for evaluation of estimatorssolving such problem. In recent work, (Llu´ıs et al. , 2020)showed that the problem fits within the context of deeplearning-based methods for image reconstruction. Specif-ically, the tasks of inpainting, (Bertalmio et al. , 2000; Liu et al. , 2018), and super-resolution, (Dong et al. , 2016;Ledig et al. , 2017), which can be paralleled to the tasksof filling in the grid points that are not measured in thesound fields D L,Po \S o , as well as upsampling the grid res-olution to achieve fine-grained variations in sound fields.One realization is that these methods are designed towork with real-valued images. To accommodate this,(Llu´ıs et al. , 2020) propose to reconstruct only the mag-nitude of the sound field, i.e. | s ( r , ω ) | , using a U-Net-likearchitecture, (Ronneberger et al. , 2015).To this end, the sampled grids are defined as ten-sors together with masks specifying which positionsare measured (Llu´ıs et al. , 2020). As an example, {| s ( r , ω k ) |} r ∈D L,Po ,k can be constructed as a tensor of theform S mag ∈ R IL × JP × K . The network is trained using alarge number of simulated realizations of rooms, as willbe described in the following section. For the experi-ments, we are interested in assessing the ability of themodel to generalize to a wide range of real rooms. A. Simulation of Sound Fields for Training Data
Green’s function can be used to approximate soundfields in rectangular rooms that are lightly damped, (Ja- cobsen and Juhl, 2013). The function provides a solutionas an infinite summation of room modes in the three di-mensions of a room, x , y , and z . It is defined as follows G ( r , r , ω ) ≈ − V (cid:88) N ψ N ( r ) ψ N ( r )( ω/c ) − ( ω N /c ) − jω/τ N (9)where (cid:80) N = (cid:80) ∞ n x =0 (cid:80) ∞ n y =0 (cid:80) ∞ n z =0 , for compactness,denotes summation across modal orders in the three di-mensions of the room, and similarly the triplet of inte-gers ( n x , n y , n z ) are represented by N . Furthermore, V denotes the volume of the room, ω N represents angularresonance frequency of a mode associated with a specific N , the shape of the mode is denoted ψ N ( · ), τ N is the timeconstant of the mode, and c is the speed of sound. As-suming rigid boundaries, the shape is determined usingthe expression (Jacobsen and Juhl, 2013) ψ N ( x ) = Λ N cos n x πxl x cos n y πyl y cos n z πzl z . (10)Here, Λ N = √ (cid:15) x (cid:15) y (cid:15) z are constants used for normalizationwith (cid:15) = 1, (cid:15) = (cid:15) = · · · = 2. Using Sabine’s equation,the absorption coefficient is calculated and used to de-termine time constants of each mode.This is done by as-suming that surfaces of a room have uniform distributionof absorption.In the following experiments, two sets of trainingdata are used. The first dataset is introduced in (Llu´ıs et al. , 2020) and consists of 5,000 rectangular rooms. Theroom dimensions are sampled randomly in accordancewith the recommendations for listening rooms in ITU-RBS.1116-3 (ITU-R BS.1116-3, 2015). The dataset uses a oom BVR LabList. RoomProd. Room FIG. 2. NMSE in dB of U-Net-based magnitude reconstruc-tion in the four measured rooms with n mic = 15 using theoriginal pretrained model presented in (Llu´ıs et al. , 2020). Room BVR LabList. RoomProd. Room
FIG. 3. NMSE in dB of U-Net-based magnitude reconstruc-tion in the four measured rooms with n mic = 15 using themodel presented in (Llu´ıs et al. , 2020) trained using the ex-tended dataset. constant reverberation time T of 0.6 s and only includesroom modes in the x and y dimensions, i.e. n z = 0.The second dataset consists of 20,000 rectangularrooms. Room dimensions are uniformly sampled with V ∼ U (50 , , l x ∼ U (3 . , l z ∼ U (1 . , . l y = V /l x l z . Compared to the first dataset, the roomdimensions span a larger range and allow us to represente.g. the Product Room, which is not included in the orig-inal training data. The dataset uses reverberation times T sampled from U (0 . , . x -, y -, and z -dimensions.For both datasets, a grid D L,Po is defined with I = J = 8 and L = P = 4, which effectively divides a soundfield into 32x32 uniformly-spaced microphone positions.Using this grid, the magnitude of the sound field is re-constructed at 1/12 octave center-frequencies resolutionin the range [30, 300] Hz. Simulations are specified toinclude all room modes with a resonance frequency be-low 400 Hz, which means that there is a total of K = 40frequency slices. B. Experiments on the ISOBEL Sound Field Dataset
The U-Net-like architecture has shown promising re-sults on simulated data and on measurements from a sin-gle real room (Llu´ıs et al. , 2020). In the following experi-ments, we expose the model to the ISOBEL Sound Fielddataset. We include results from the original model, aswell as a model built around a similar architecture butusing the extended training data with a larger range ofroom dimensions and reverberation characteristics. Weinvestigate the performance of the model trained withthe two different simulated datasets in the four roomsincluded in the real-world dataset. Special attention ispaid to the number of available samples, i.e. the numberof microphones n mic . We are mainly interested in set-tings with a very low number of microphones. In partic-ular, we show results for 5, 15, and 25 microphones in therooms with a total of 32 ×
32 = 1024 available positions.In each room, a total of 40 different and randomly sam-pled realizations of microphone positions S o are used foreach value of n mic . We report the average performanceacross the 40 realizations, and use the source located inone of the corners of each room.Fig. 2 and Fig. 3 show NMSE k results for 15 mi-crophones of model trained with the original and theextended datasets, respectively. It is clear that themodel trained with the original dataset does not gener-alize well to all the rooms. This behavior is expected,since the training data are not designed to representrooms that fall outside the recommendations for listeningroom dimensions. On the contrary, the extended trainingdata are motivated in encompassing a wider selection ofrooms, which also shows in the results for e.g. the Prod-uct Room. One important observation in this regard isthat performance does not decrease in rooms that arealready represented in the simulated data when more di-verse simulated rooms are included, which can e.g. beseen from the performance in Room B. This result in-dicates that the capacity of the model is sufficient forgeneralizing to a wide range of diverse rooms and room TABLE II. MNMSE in dB with M = 40 different and ran-domly sampled realizations of S o for each room in the ISOBELSF dataset. A lower score is better. n mic Room Model 5 15 25Room B Orig. -6.33 -8.71 -9.62Ext. -6.27 -8.84 -10.25VR Lab Orig. -4.01 -5.08 -5.63Ext. -4.12 -6.78 -8.05List. Room Orig. -4.38 -6.92 -7.94Ext. -5.00 -7.61 -8.44Prod. Room Orig. -3.89 -4.91 -5.55Ext. -5.18 -6.67 -7.73
Deep Sound Field Reconstruction in Real Rooms: Introducing the ISOBEL Sound Field Dataset 5 S M
256 256
512 512
512 512
768 768
256 256
384 384
80 80 PConv 5x5 PConv 3x3 Upsample 2x2 Skip/concat ˆS Encoder Decoder
FIG. 4. Architecture of the U-Net-like convolutional neural network proposed for complex sound field reconstruction. S is thetensor with real and imaginary sound fields concatenated along the frequency-dimension, M is the mask tensor, and ˆS is thereconstructed sound field tensor. acoustic characteristics, given that the model is providedwith ample training samples.Table II details MNMSE results, which are theNMSE results averaged across frequencies K = 40 and S o realizations M = 40. The MNMSE results for n mic = 15are the condensed results shown for the NMSE k in Figs. 2and 3. The scores in the table reiterate the observa-tions from the figures, performance is improved withthe extended training data for some rooms in particu-lar, while performance is maintained in the other rooms.Interestingly, there seems to be a tendency of more pro-nounced improvements with a larger number of micro-phones. We attribute this effect to similar observationswithin classical methods that as the number of micro-phones increase, relative improvement for reconstructionis higher at low frequencies as opposed to the high-frequency range, (Ajdler et al. , 2006; Llu´ıs et al. , 2020).In summary, the deep learning-based model is con-firmed to possess the ability to generalize to a diverseset of real rooms for sound field magnitude reconstruc-tion. Based solely on training with simulated data, thesepromising results motivate further investigations, e.g. ofreconstructing the complex-valued sound fields. V. COMPLEX SOUND FIELD RECONSTRUCTION
We propose to extend the U-Net-based modelto work with complex-valued room transfer functions(RTFs). Reconstruction of both magnitude and phaseof sound fields enable new opportunities, such as the ap-plication of sound zones. A topic, which we investigatein Section VI.The proposed model is based on the model de-signed to work with the magnitude of sound fields.Note that deep learning-based models that work di-rectly on complex-valued inputs have been introduced,e.g. within Transformers (Kim et al. , 2020; Yang et al. ,2020), but in this paper we instead choose to process the sound fields such that the U-Net-based model re-ceives real-valued inputs. Specifically, we present themodel to real and imaginary parts of sound fields sep-arately. That is, where the magnitude-based model re-ceive as input {| s ( r , ω k ) |} r ∈D L,Po ,k in the tensor form S mag ∈ R IL × JP × K , the complex-based model in-stead receives a concatenation of the real and imagi-nary sound fields. Specifically, using the real soundfield { s Re ( r , ω k ) } r ∈D L,Po ,k with the tensor form S Re ∈ R IL × JP × K , and similarly the imaginary sound field ten-sor S Im ∈ R IL × JP × K , we define the concatenated input: S := [ S Re S Im ] , (11)where S ∈ R IL × JP × K is the resulting tensor withreal and imaginary sound fields concatenated along thefrequency-dimension. Note that the complex-valuedsound field is easily recovered from this tensor form. Inaddition, we define a mask tensor M ∈ R IL × JP × K com-puted from S o and D L,Po .We follow the pre- and postprocessing steps as de-scribed in (Llu´ıs et al. , 2020), which entails comple-tion, scaling, upsampling, mask generation, and rescal-ing based on linear regression. These steps are, however,adjusted such that they operate on a tensor that hasdoubled in size from K to 2 K in the third dimension.Furthermore, we have observed significant improvementsby changing the min-max scaling of the input to a maxscaling that takes into account both real and imaginaryparts for each frequency slice. Specifically: s Re ,s ( r , ω k ) := s Re ( r , ω k )max r ∈S o ( | s Re ( r , ω k ) | , | s Im ( r , ω k ) | ) (12) s Im ,s ( r , ω k ) := s Im ( r , ω k )max r ∈S o ( | s Re ( r , ω k ) | , | s Im ( r , ω k ) | ) (13)for each ω k . Note that this alters the scaling operationfrom working in the range [0,1] to working in [-1,1]. The otivation in doing so, is that values can be negative, incontrast to the real values from the magnitude. By usingmax scaling we ensure that zero will not shift betweenrealizations.The architecture of the proposed neural network, asillustrated in Fig. 4, is based on a U-Net (Ronneberger et al. , 2015). We employ partial convolutions (PConv) asproposed for image inpainting in (Liu et al. , 2018). In theencoding part of the U-Net, we use a stride of two in thepartial convolutions in order to halve the feature maps,while doubling the number of kernels in each layer. Thedecoder part acts opposite with upsampling feature mapsand reducing the number of kernels to reach an outputtensor ˆS with matching dimensions to the input tensor S . We use ReLU as activation function in the encodingpart, and leaky ReLU with a slope coefficient of -0.2 inthe decoder. We initialize the weights using the uniformXavier method (Glorot and Bengio, 2010), initialize thebiases as zero, and use the Adam optimizer (Kingma andBa, 2014) with early stopping when performance on a val-idation set stops increasing. Due to the increased inputand output sizes, we double the number of kernels in alllayers compared to the U-Net for magnitude reconstruc-tion. We also do not use a 1x1 convolution with sigmoidactivation in the last layer, since the range of our outputis not constrained to [0,1] but instead [-1,1]. We havenot experienced any decreases in performance from notincluding this layer. A. Experiments
In this section, we assess the complex-valued soundfield reconstruction. The simulated extended dataset in-troduced in Section IV A is used to train the model. Itis important to note that NMSE scores are not directlycomparable between magnitude and complex reconstruc-tion, for which reason it is not possible to scrutinize dif-ferences between the two types of models. That is, theresults presented in the following experiments will standon their own, and only indicative parallels can be drawnto the results from magnitude reconstruction.First, we test how the model performs on the sim-ulated data associated with the training data, but heldout specifically for evaluation. This test set consists of190 simulated rooms, the validation set contains approx-imately 1,000 rooms, and the training set holds the re-maining rooms from the 20,000 available rooms. In eachroom, three different realizations of S o are used for eachvalue of n mic . Results in terms of NMSE are shown inFig. 5. Some tendencies are similar to those observedfor magnitude reconstruction, such as improvements inperformance with an increasing number of available mi-crophones. At the same time, as frequency increases,performance degrades.Next, we evaluate the complex reconstruction modelon the ISOBEL Sound Field dataset. The approach issimilar to the experiment in Section IV B, except the useof the complex-valued sound fields instead of the mag-nitude. As can be seen from the results in Fig. 6, per- FIG. 5. NMSE in dB for complex reconstruction of simulatedsound fields in the test set with 190 different rooms and threerealizations of S o in each room ( M = 570 for each value of n mic ). The solid lines indicate average NMSE k shown with95% confidence intervals. Colors indicate different values of n mic in the range [5, 55]. Room BVR LabList. RoomProd. Room
FIG. 6. Average NMSE k in dB of complex reconstruction inthe four measured rooms with n mic = 15. formances in the real rooms are not comparable to thosefrom simulated data. Moreover, although it is not pos-sible to compare directly, performance seems worse thanwhat is achieved with the magnitude-based reconstruc-tion in the same rooms, see Fig. 3. That is, the complexreconstruction model is not transferring useful knowledgeas successfully from the simulations-based training to thereal world. Given that the network is able to reconstructthe simulated sound fields, it appears that the complexsimulation model is a worse match for the real rooms thanthe magnitude simulation model. The outcome is thatthe framework is able to reconstruct sound fields whichare close to fields included in the training data, it is indi-cated that the complex simulations are a poor match forthe real rooms. Two apparent differences are the iden-tical boundary conditions at all surfaces and perfectlyrectangular geometry assumed in the simulations, butwhich are not true in the real rooms. To provide insightsinto how the network behaves relative to rooms whichdoes not match the training data set we now present thefollowing simulations. Deep Sound Field Reconstruction in Real Rooms: Introducing the ISOBEL Sound Field Dataset 7 rain ↓ Test → SimulatedList. Room SimulatedList. Room l x + U ( − . , . l x + U ( − . , . l x + U ( − . , . l x + U ( − . , . S o are used in each of the 11 test rooms( M = 44). The solid lines indicate average NMSE k shown with 95% confidence intervals. Colors indicate different n mic values,i.e., n mic = 5 (blue), n mic = 15 (orange), n mic = 25 (green), n mic = 35 (red), n mic = 45 (purple), and n mic = 55 (brown). B. Discussion of Experiments
Several optimizations and fine-tuning approacheshave been investigated for the complex reconstructionin real rooms without achieving notable improvements.Instead, we take another approach, and show what hap-pens to the model, when it is exposed to data that are notrepresented in the training data. To this end, we are in-terested in assessing the performance of room specializedmodels. That is, if room dimensions and reverberationtime are known, how well will a model trained specifi-cally for that room perform. For this, we introduce newdatasets each with 824 realizations for training, 165 forvalidation, and 11 for testing. Each simulated realiza-tion has a randomly positioned source. In total, threesuch datasets are generated according to the proceduredescribed in Section IV A. The first dataset assumes thatroom characteristics are known perfectly, we use the pa-rameters of the Listening Room. The second and thirddatasets introduce uncertainty in the room dimensions.In particular, we alter the length and width of rooms,while keeping the aspect ratio ( l x /l y ) of the room con-stant. We accomplish this by uniformly sampling an er-ror, which is added to the length of a room, and cor-rect the width to achieve the original aspect ratio. Thetwo datasets sample errors in the range [-0.25, 0.25] mand [-1, 1] m, respectively. The results for the threemodels evaluated on each of the test sets are shown inFig. 7. The first column shows how the three models per-form on the dataset with no added uncertainties. Even with small variations of the 0.25 m scale, performancerapidly degrades with increasing frequency. On the di-agonal, training data match test data, and once againhigh frequencies see a significant performance decreasewith increasing uncertainty. In general, the models donot perform well on datasets with more variation thanwhat is included in their own training data, which canbe seen in the three upper right figures.Further experiments showed that the three modelsdo not generalize to the real-world measurements of theListening Room. This result indicates that the simplifi-cations imposed during the simulations of rooms causesthe simulated sound fields to not represent the exact realrooms we intend it to. That is, a model trained withsimulated data generated using exact parameters of areal room will not be able to reconstruct the sound fieldaccurately in the real room. As suggested by our results,neither will a model trained with ± ribution of absorption. Thus, we hypothesize that themodel does not see representative data during training,analogous to not having the correct room dimensions rep-resented in the training data. VI. THE SOUND ZONES APPLICATION
One potential application for the sound field recon-struction presented in this paper, is in the process of set-ting up sound zones. Sound zones generally refers to thescenario where multiple loudspeakers are used to repro-duce individual audio signals to individual people withina room (Betlehem et al. , 2015). To control the soundfield at the location of the listeners in the room, it is nec-essary to know the RTFs between each loudspeaker andlocations sampling the listening regions. If the desired lo-cations of the sound zones change over time, it becomeslabor intensive to measure all the RTFs in situ. As analternative, a small set of RTFs could be measured andused to extrapolate the RTFs at the positions of interest.
1. Setup
For this example, we will explore the scenario wheresound is reproduced in one zone (the bright zone) andsuppressed in another zone (the dark zone). The question posed in a sound zones scenario, is howthe output of the available loudspeakers should be ad-justed to achieve the desired scenario. A simple formula-tion of this problem in the frequency domain is typicallydenoted acoustic contrast control and relies on maximiz-ing the ratio of mean square pressure in the bright zonerelative to the dark zone (Choi and Kim, 2002). Thisratio is termed as the acoustic contrast and can be ex-pressed as Contrast( ω ) := (cid:107) H B ( ω ) q ( ω ) (cid:107) (cid:107) H D ( ω ) q ( ω ) (cid:107) (14)where H B ( ω ) ∈ C M × L is a matrix of RTFs from L loud-speakers to M microphone positions in the bright zoneand H D ( ω ) ∈ C M × L are the RTFs from the loudspeak-ers to points in the dark zone. The adjustment of theloudspeaker responses q ( ω ) ∈ C L can be determined asthe eigenvector of ( H HD ( ω ) H D ( ω )+ λ D I ) − H HB ( ω ) H B ( ω )which corresponds to the maximal eigenvalue (Elliott et al. , 2012), where · H denotes the Hermitian transpose.In this investigation, the regularization parameter is cho-sen as λ D = 0 . (cid:107) H HD ( ω ) H D ( ω ) (cid:107) . (15)This choice is made to scale the regularization relativeto the maximal singular value of H HD ( ω ) H D ( ω ), thereby,controlling the condition number of the inverted matrix.
2. Sparse Reconstruction method
An alternative method for estimating the RTFs atpositions of interest can be obtained by a sparse recon-struction problem inspired by (Fernandez-Grande, 2019). Here, the sound pressure observed at the physical micro-phone locations are modeled as a combination of imping-ing plane waves s ( r , ω )... s ( r M , ω ) (cid:124) (cid:123)(cid:122) (cid:125) s ( ω ) = φ ( r ) · · · φ N ( r )... . . . ... φ ( r M ) · · · φ N ( r M ) (cid:124) (cid:123)(cid:122) (cid:125) Φ b ( ω )... b N ( ω ) (cid:124) (cid:123)(cid:122) (cid:125) b ( ω ) (16)where s ( · , · ) is defined in (2), φ n ( r m ) = e j k Tn r m is thecandidate plane wave, propagating with wave number k n ∈ R , to observation point r m ∈ R , and b n ( ω ) ∈ C isthe complex weight of the n th candidate plane wave. Thecandidate plane waves can be obtained by sampling thewave number domain in a cubic grid. Note that the eigen-functions of the room used in Green’s function can be ex-panded into a number of plane waves whose propagationdirections in the wave number domain equals the charac-teristic frequency of the eigenfunction ( (cid:107) k n (cid:107) = ( ω/c ) ).This fact was used in (Fernandez-Grande, 2019) to reg-ularize the sparse reconstruction problem asmin b ( ω ) (cid:107) s ( ω ) − Φb ( ω ) (cid:107) + λ (cid:107) L ( ω ) b ( ω ) (cid:107) (17)where λ ∈ R + and L ( ω ) ∈ R N × N is a diagonal matrix,where the diagonal elements express the distance betweenthe characteristic frequency associated with the n th can-didate plane wave and the angular excitation frequency ω as |(cid:107) k N (cid:107) − ( ω/c ) | .Note that the sparse reconstruction model is not di-rectly comparable to the proposed sound field reconstruc-tion. This is due to the sparse reconstruction relying onknowledge of the absolute locations of the microphone ob-servations. The proposed algorithm, on the other hand,only requires the relative microphone locations on a unit-less observation grid.
3. Experiments
For the experiments, we use the simulated ListeningRoom from the previous section, with eight loudspeakersplaced at the corners of the floor and halfway betweenthe corners. We have two predefined zones in the middleof the room, which are bright and dark zone respectively.We now, sample random positions in the 32 by 32 x,y-grid 1 m above the floor and use those observations toestimate the RTFs within the zones.We compare the sparse reconstruction method to thedeep learning-based model trained in the previous sec-tion. Specifically, the room specialized models are used.The resulting performance is evaluated in terms ofthe acoustic contrast over 50 random microphone sam-plings for each number of microphones. In Fig. 8 theresults are based on evaluations using the true RTFswhen the loudspeaker weights are determined using ei-ther the true RTFs, estimated RTFs based on the modeltrained with simulated room with no added uncertain-ties, or estimates based on the sparse reconstruction. It
Deep Sound Field Reconstruction in Real Rooms: Introducing the ISOBEL Sound Field Dataset 9
Frequency [Hz] -100102030 C on t r a s t [ d B ] (a)
50 100 200
Frequency [Hz] -100102030 C on t r a s t [ d B ] (b)FIG. 8. Contrast results for the dataset with no added uncer-tainty to the simulated Listening Room (50 different obser-vation masks). (blue): Perfectly known TFs. (black): Deeplearning model. (red): Sparse reconstruction. (dashed): ± is observed that the deep learning-based model performsbetter than the sparse reconstruction below 150 Hz for 5and 15 microphones. Above 150 Hz, both models strug-gle to provide sufficiently accurate RTFs to create soundzones.In Fig. 9, the model specialized for the ListeningRoom with l x + U ( − . , .
0) m, is compared to the sparsereconstruction. As expected, the resulting performanceis reduced for the model. However, it is observed thatthere is still a benefit when using 5 microphones. At15 microphones, on the other hand, the performance iscomparable for both methods.These results indicate that sound zones could be cre-ated based on sound fields extrapolated from very fewmicrophone positions. However, at this stage it requiresmodels which are specialized to the particular room ora narrow range of rooms. Alternatively, it would be re-quired to increase the number of microphones to improvethe accuracy of the estimated RTFs.
VII. CONCLUSION
In this paper, deep learning-based sound field recon-struction is evaluated using a new set of extensive mea-
50 100 200
Frequency [Hz] -100102030 C on t r a s t [ d B ] (a)
50 100 200
Frequency [Hz] -100102030 C on t r a s t [ d B ] (b)FIG. 9. Contrast results for the simulated Listening Roomwith l x + U ( − . , .
0) m (50 different observation masks).(blue): Perfectly known TFs. (black): Deep learning model.(red): Sparse reconstruction. (dashed): ± surements from real rooms, which are released alongsidethe paper. The focus of the work is threefold: exam-ine performance of simulation-based learning of magni-tude reconstruction in real rooms, extend reconstructionto complex-valued sound fields, and show a sound zoneapplication taking advantage of the reconstructed soundfields. Experiments for each of the three directions indi-cate promising aspects of data-driven sound field recon-struction, even with a low number of arbitrarily placedmicrophones.In the future, it would be of interest to investigatewhether transfer learning can help bridge the discrep-ancies between simulated and real data. With the ad-dition of more rooms, some could be used in the train-ing phase. Furthermore, three-dimensional reconstruc-tion can be achieved using available convolutional modelsdesigned specifically to solve three-dimensional problems. ACKNOWLEDGMENTS
This work is part of the ISOBEL Grand Solutionsproject, and is supported in part by the Innovation FundDenmark (IFD) under File No. 9069-00038A.
10 Deep Sound Field Reconstruction in Real Rooms: Introducing the ISOBEL Sound Field Dataset
The data are collected under the Interactive Sound Zones for Bet-ter Living (ISOBEL) project, which aims to develop interactivesound zone systems, responding to the need for sound exposurecontrol in dynamic real-world contexts, adapted to and tested inhealthcare and homes. The ISOBEL Sound Field dataset can beaccessed at https://doi.org/10.5281/zenodo.4501339 . Further details of the experimental setup and protocol, e.g. equip-ment, are available in the measurement reports included with thedataset. See footnote 2. Room B has measurements at a single height: 1 meter above thefloor. The use case with multiple individual audio signals can be realizedusing superposition of this solution and one where the role of brightand dark zone are reversed.Ajdler, T., Sbaiz, L., and Vetterli, M. ( ). “The PlenacousticFunction and Its Sampling,” IEEE Transactions on Signal Pro-cessing (10), 3790–3804, doi: .Antonello, N., Sena, E. D., Moonen, M., Naylor, P. A., and vanWaterschoot, T. ( ). “Room Impulse Response InterpolationUsing a Sparse Spatio-Temporal Representation of the SoundField,” IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing (10), 1929–1941, doi: .Bertalmio, M., Sapiro, G., Caselles, V., and Ballester, C. ( ).“Image inpainting,” in Proceedings of the 27th Annual Confer-ence on Computer Graphics and Interactive Techniques , SIG-GRAPH ’00, ACM Press/Addison-Wesley Publishing Co., USA,pp. 417–424, doi: .Betlehem, T., Zhang, W., Poletti, M. A., and Abhayapala, T. D.( ). “Personal Sound Zones: Delivering interface-free audioto multiple listeners,” IEEE Signal Processing Magazine (2),81–91, doi: .Cecchi, S., Carini, A., and Spors, S. ( ). “Room ResponseEqualization—A Review,” Applied Sciences (1), 16, doi: .Choi, J., and Kim, Y. ( ). “Generation of an acoustically brightzone with an illuminated region using multiple sources,” Journalof the Acoustical Society of America (4), 1695–1700.Dong, C., Loy, C. C., He, K., and Tang, X. ( ). “Image Super-Resolution Using Deep Convolutional Networks,” IEEE Transac-tions on Pattern Analysis and Machine Intelligence (2), 295–307, doi: .Elliott, S. J., Cheer, J., Choi, J., and Kim, Y. ( ). “Robustnessand regularization of personal audio systems,” IEEE Transac-tions on Audio, Speech, and Language Processing (7), 2123–2133.Farina, A. ( ). “Simultaneous Measurement of Impulse Re-sponse and Distortion with a Swept-Sine Technique,” in Pro-ceedings of the Audio Engineering Society Convention 108 .Fernandez-Grande, E. ( ). “Sound field reconstruction in aroom from spatially distributed measurements,” in , pp. 4961–68.Genovese, A. F., Gamper, H., Pulkki, V., Raghuvanshi, N., and Ta-shev, I. J. ( ). “Blind Room Volume Estimation from Single-channel Noisy Speech,” in
ICASSP 2019 - 2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 231–235, doi: .Glorot, X., and Bengio, Y. ( ). “Understanding the difficultyof training deep feedforward neural networks,” in
Proceedingsof the Thirteenth International Conference on Artificial Intelli-gence and Statistics , pp. 249–256.ISO 3382-2:2008 ( ). “Acoustics — Measurement of roomacoustic parameters — Part 2: Reverberation time in ordinaryrooms,” Standard.ITU-R BS.1116-3 ( ). “Methods for the subjective assessmentof small impairments in audio systems,” Standard.Jacobsen, F., and Juhl, P. M. ( ). Fundamentals of GeneralLinear Acoustics (John Wiley & Sons).Karjalainen, M., Makivirta, A., Antsalo, P., and Valimaki, V.( ). “Low-frequency modal equalization of loudspeaker-roomresponses,” in
Audio Engineering Society Convention 111 . Kim, J., El-Khamy, M., and Lee, J. ( ). “T-GSA: Trans-former with Gaussian-Weighted Self-Attention for Speech En-hancement,” in
ICASSP 2020 - 2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) ,pp. 6649–6653, doi: .Kingma, D. P., and Ba, J. ( ). “Adam: A Method for Stochas-tic Optimization,” arXiv:1412.6980 [cs] .Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A.,Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., and Shi,W. ( ). “Photo-realistic single image super-resolution using agenerative adversarial network,” in
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR) .Liu, G., Reda, F. A., Shih, K. J., Wang, T.-C., Tao, A., and Catan-zaro, B. ( ). “Image Inpainting for Irregular Holes Using Par-tial Convolutions,” in
Computer Vision – ECCV 2018 , edited byV. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, LectureNotes in Computer Science, Springer International Publishing,Cham, pp. 89–105, doi: .Llu´ıs, F., Mart´ınez-Nuevo, P., Møller, M. B., and Shepstone, S. E.( ). “Sound field reconstruction in rooms: Inpainting meetssuper-resolution,” The Journal of the Acoustical Society of Amer-ica (2), 649–659, doi: .Mignot, R., Chardon, G., and Daudet, L. ( ). “Low Fre-quency Interpolation of Room Impulse Responses Using Com-pressed Sensing,” IEEE/ACM Transactions on Audio, Speech,and Language Processing (1), 205–216, doi: .Møller, M. B., and Østergaard, J. ( ). “A Moving Hori-zon Framework for Sound Zones,” IEEE/ACM Transactionson Audio, Speech, and Language Processing , 256–265, doi: .Nowakowski, T., de Rosny, J., and Daudet, L. ( ). “Robustsource localization from wavefield separation including prior in-formation,” The Journal of the Acoustical Society of America (4), 2375–2386, doi: .Radlovic, B. D., Williamson, R. C., and Kennedy, R. A. ( ).“Equalization in an acoustic reverberant environment: Robust-ness results,” IEEE Transactions on Speech and Audio Process-ing (3), 311–319, doi: .Ronneberger, O., Fischer, P., and Brox, T. ( ). “U-Net: Con-volutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention– MICCAI 2015 , edited by N. Navab, J. Hornegger, W. M.Wells, and A. F. Frangi, Lecture Notes in Computer Science,Springer International Publishing, Cham, pp. 234–241, doi: .Tylka, J. G., and Choueiri, E. ( ). “Comparison of techniquesfor binaural navigation of higher-order ambisonic soundfields,”in
Audio Engineering Society Convention 139 .Verburg, S. A., and Fernandez-Grande, E. ( ). “Reconstructionof the sound field in a room using compressive sensing,” TheJournal of the Acoustical Society of America (6), 3770–3779,doi: .Vu, T. P., and Lissek, H. ( ). “Low frequency sound field recon-struction in a non-rectangular room using a small number of mi-crophones,” Acta Acustica (2), 5, doi: .Yang, M., Ma, M. Q., Li, D., Tsai, Y. H., and Salakhutdinov,R. ( ). “Complex Transformer: A Framework for ModelingComplex-Valued Sequence,” in ICASSP 2020 - 2020 IEEE In-ternational Conference on Acoustics, Speech and Signal Process-ing (ICASSP) , pp. 4232–4236, doi: .Yu, W., and Kleijn, W. B. ( ). “Room Acoustical Parame-ter Estimation From Room Impulse Responses Using Deep Neu-ral Networks,” IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , 436–447, doi: ..