Direction of Arrival Estimation of Noisy Speech Using Convolutional Recurrent Neural Networks with Higher-Order Ambisonics Signals
Nils Poschadel, Robert Hupke, Stephan Preihs, Jürgen Peissig
aa r X i v : . [ ee ss . A S ] M a r Direction of Arrival Estimation of Noisy Speechusing Convolutional Recurrent Neural Networkswith Higher-Order Ambisonics Signals
Nils Poschadel, Robert Hupke, Stephan Preihs and J ¨urgen Peissig
Institute of Communications TechnologyLeibniz University Hannover, Hannover, GermanyEmail: { poschadel, hupke, preihs, peissig } @ikt.uni-hannover.de Abstract —Training convolutional recurrent neural networks onfirst-order Ambisonics signals is a well-known approach whenestimating the direction of arrival for speech/sound signals. Inthis work, we investigate whether increasing the order of Am-bisonics up to the fourth order further improves the estimationperformance of convolutional recurrent neural networks. Whileour results on data based on simulated spatial room impulseresponses show that the use of higher Ambisonics orders doeshave the potential to provide better localization results, no furtherimprovement was shown on data based on real spatial roomimpulse responses from order two onwards. Rather, it seems tobe crucial to extract meaningful features from the raw data. Firstorder features derived from the acoustic intensity vector weresuperior to pure higher-order magnitude and phase features inalmost all scenarios.
Index Terms —Direction of arrival estimation, higher-orderAmbisonics, convolutional recurrent neural network, sphericalharmonics.
I. I
NTRODUCTION
Estimating the direction of arrival (DOA) of sound/speech isa key problem in acoustic signal processing. Neural networkshave been shown to be superior to classical parametric ap-proaches in this task, especially in reverberant, noisy, and low-SNR environments [1]–[4]. Recently, DOA estimation basedon first-order Ambisonics (FOA) signals has been the subjectof much attention [4]–[7]. Due to the flexibility and general-izability of the Ambisonics approach, it more or less enablesmicrophone-array-independent DOA estimation models.Perotin et al. [4], [8], [9] investigated the effect of differ-ent parameters when training convolutional recurrent neuralnetworks (CRNNs) on FOA data for the DOA estimation ofnoisy speech. They proposed the usage of features derivedfrom the sound intensity vector as input for the training,achieving greater accuracy in DOA estimation than with usingpure magnitude and phase information [8]. Furthermore, theyshowed that a regression approach is at least as suitable asa classification interpretation for a single-source DOA esti-mation with diffuse interference and that a CRNN trained onspherical coordinates performs worse than a network trainedon Cartesian coordinates when using the mean squared error(MSE) or angular distance as loss function [9].However, despite the increasing availability of higher-orderambisonics (HOA) microphones, very little research is con- ducted on the performance of DOA estimators based on HOAsignals.There are some results from other applications where theusage of HOA signals is advantageous over the use of FOAsignals. Pointer experiments with subjects showed a positiveinfluence of the order of the Ambisonics signal on perceptuallocalization accuracy in a loudspeaker reproduction of a soundfield [10]. Similarly, the higher-order model of directionalaudio coding (HO-DirAC) achieved a higher reproductionaccuracy than the first-order DirAC in a perceptual evaluation[11], [12]. Investigations on spherical harmonic (SH) beam-forming with unsupervised peak clustering [13] also showedan improvement in localization accuracy with increasing Am-bisonics order. However, to our knowledge, this topic has notyet been investigated or even quantified for state of the artdeep learning approaches for DOA estimation.This work therefore is the first to apply the idea of CRNN-based DOA estimation to HOA signals and to investigatewhether or how much the additional spatial information con-tained in HOA can improve the estimation accuracy. Wethereby compare our HOA models with FOA models based onboth magnitude/phase spectrograms and spectrogram featuresderived from the acoustic intensity vector.To the best of our knowledge, there is no sufficiently largedataset of HOA speech signals or impulse responses available.Therefore, we had to create a suitable dataset of noisy speechdata with different orders of Ambisonics, taking inspirationfrom the procedure used in [9] for creating a FOA dataset. Dueto the way we parameterize the impulse response simulation,this dataset can not only be used for training deep learningmodels for DOA estimation. The dataset also contains labelsregarding room size/geometry as well as acoustic propertiessuch as reverberation time and absorption/scattering coeffi-cients and will serve as the basis for a number of studiesin the context of acoustic analysis based on HOA signals.We present the details on the generation of our training,validation, and testing data in Sec. III after a brief introduc-tion to the fundamentals of Ambisonics and SH in Sec. II.The configuration of the trained model and the metrics aredescribed in Sec. IV. Finally, the results based on simulatedand measured data are compared and discussed in Sec. V andsummarized in Sec. VI.I. A
MBISONICS
Ambisonics is a 3D audio surround representation andrendering approach based on the spatial decomposition of thesound field in the orthonormal basis of SH [4], [14]. Thissection gives an overview of the mathematical principles ofAmbisonics. This condensed description of the SH decompo-sition is based on the more detailed presentation in [15], [16].In the following, the Cartesian ( x, y, z ) ∈ R and thespherical ( r, θ, φ ) = ( r, Ω) ∈ [0 , ∞ ) × ( − π , π ] × [ − π, π ] coordinate systems are used. The x -, y - and z -axes pointto the front, left and top, respectively. The angle φ is theazimuth, which is zero at the frontal direction and increasingcounterclockwise; θ is the elevation, which is zero at thehorizontal plane and positive above, and r is the radius.Consider a function f ( θ, φ ) = f (Ω) ∈ L (cid:0) S (cid:1) onthe unit 2-sphere S := (cid:8) x ∈ R : k x k = 1 (cid:9) , then the SHdecomposition of f is given by f (Ω) = ∞ X n =0 n X m = − n f nm Y mn (Ω) , (1)where Y mn is the spherical harmonic of order n and degree m .The coefficients f nm are calculated by f nm = Z Ω ∈ S f (Ω) Y mn ∗ (Ω) dΩ , (2)where R Ω ∈ S dΩ = R π − π R π/ − π/ sin θ d θ d φ . Equations (1) and(2) show that any square-integrable function on the unit 2-sphere can be approximated by a linear combination of theSH. This approximation even becomes exact for an infinitenumber of SH. In this paper, the ambiX format [14] is usedfor the (real) SH Y mn : Y mn ( θ, φ ) = N | m | n P | m | n (sin( θ )) ( sin( | m | φ ) , for m < | m | φ ) , for m ≥ with the Legendre-functions P mn . To build the set of Ambison-ics signals according to ambiX, the channels correspondingto the SH are ordered by the Ambisonics channel numberACN = n + n + m and normalised by the SN3D normalisation N | m | n = s − δ m π ( n − | m | )!( n + | m | )! . In the special case of FOA, the channels 1-4 according to ACNare often referred to as
W, Y, Z, X .III. D
ATA
A. Simulated SRIRs
The training, validation and testing data was generated froma set of spatial room impulse responses (SRIRs) simulatedwith the MCRoomSim toolbox [17] as Ambisonics signalsup to fourth order corresponding to the ambiX format. Theapproach was inspired by the procedure described in [4].Alltogether we generated 8000, 500, and 500 rooms withrandom dimensions in [3 , × [3 , × [2 , m for the training,validation, and testing set, respectively. The acoustic properties of the walls (frequency dependant scattering and absorptioncoefficients) were set to plausible, randomly chosen surfacesof the GRAP database [18]. For every room, one receiver wasrandomly positioned with a minimum distance of 1.5 m to thewalls. Furthermore, one source was randomly positioned at8 different locations such that the DOAs in the dataset areuniformly distributed. The distance from the source to thereceiver was chosen randomly, ensuring that the source andthe receiver are at least 1 m apart from each other and thatthe source is at least 49 cm from a wall. With this setup,we simulated 64 000, 4000 and 4000 fourth-order AmbisonicsSRIRs. Although the experiments in this paper were conductedusing speech signals with a sampling rate of 16 kHz, the SRIRswere simulated with a sampling frequency of 48 kHz to be ableto expand the methods of this paper to general audio/musicsignals using the same database. After resampling, the SRIRswere convolved with a randomly chosen sentence from theTIMIT database [19]. This database contains a total of 6300sentences, 10 sentences spoken by each of the 630 speakers(192 female, 438 male) from eight major dialect regions ofthe United States. The TIMIT database was split into training,testing and validation sets resulting in 462 (136 female, 326male), 88 (30 female, 58 male), and 80 (26 female, 54 male)speakers, respectively. The training set corresponds to therecommended one by the authors of the TIMIT database.The test set includes the recommended core test set and itis ensured that there is at least one female/male speaker perdialect in the validation and test set, respectively.Furthermore, we added ambient noise to the speech signalssimilar to the procedure in [4]. Therefore, we generatedsingle-channel babble noise by overlaying 50 sentences ofthe respective sets. This babble noise was then convolvedwith a diffuse SRIR, which was generated by averaging threesimulated diffuse parts of SRIRs with a receiver placed inthe middle of a random room and a randomly positionedsource. This ambient noise was added to the speech signalat a signal-to-noise ratio (SNR) between 0 and 20 dB. Finally,these sentences were cut to one-second-sequences which ledto 164 303, 10 285 and 10 394 sequences for the training,validation and testing set, respectively. B. Real SRIRs
For the analysis of DOA estimation performance in a morerealistic scenario, we measured real SRIRs in the ImmersiveMedia Lab (IML) [20] at the Institute of CommunicationsTechnology. We measured the SRIRs from each of our 36KH120 loudspeakers to an em32 Eigenmike ® [21] microphoneat nine different positions, each with two different heights andeight different orientations of our microphone. In total, thedescribed procedure led to 5184 measured SRIRs in the IML,which were afterwards encoded to a fourth-order Ambisonicssignal using the EigenUnit-em32-encoder . These measuredSRIRs were used according to the same procedure as for thesimulated SRIRs to generate HOA multispeaker signals which https://mhacoustics.com/eigenunits esulted in 13 414 sequences for the testing set based on realSRIRs. IV. D OA E STIMATION F RAMEWORK
A. Networks and metrics
Our trained networks follow a similar basic CRNN structurecompared to the ones in [3], [4]. A detailed overview of thenetwork’s architecture is given in Table I, where the finalnormalization layer scales the prediction to lie on the unit2-sphere. We formulated this task as a regression problemwith the MSE loss function and the Nadam optimizer [22].For training the network, we used the TensorFlow platform[23]. Since we use a time-distributed output layer and assume
Layer Details Output Shape
Input Spectrograms (50, 512, dim in )Conv2D × (50, 512, n filter )BatchNorm (50, 512, n filter )Activation elu (50, 512, n filter )MaxPooling × (50, 64, n filter )Dropout 0.2 (50, 64, n filter )Conv2D × (50, 64, n filter )BatchNorm (50, 64, n filter )Activation elu (50, 64, n filter )MaxPooling × (50, 8, n filter )Dropout 0.2 (50, 8, n filter )Conv2D × (50, 8, n filter )BatchNorm (50, 8, n filter )Activation elu (50, 8, n filter )MaxPooling × (50, 2, n filter )Dropout 0.2 (50, 2, n filter )Reshape (50, 2 · n filter )BiLSTM (50, 2 · n filter )BiLSTM (50, 2 · n filter )Time-Dist. Dense elu (50, 2 · n filter )Dropout 0.2 (50, 2 · n filter )Time-Dist. Dense linear (50, 3)Normalization (50, 3) TABLE I. Architecture of the CRNNs for DOA estimation.the sources to be static over the whole duration of thesignal, we first average the network outputs for each axisover time. We then compare the predicted DOA (ˆ θ, ˆ φ ) withthe reference ( θ, φ ) used to synthesize the dataset, using the angular distance δ (cid:2) (ˆ θ, ˆ φ ) , ( θ, φ ) (cid:3) defined by δ (cid:2) (ˆ θ, ˆ φ ) , ( θ, φ ) (cid:3) = arccos (cid:2) sin (ˆ θ ) sin ( θ )+ cos (ˆ θ ) cos ( θ ) cos ( ˆ φ − φ ) (cid:3) . For additional evaluation, we further define the so-called accuracy as the proportion of samples for which the predictionhas an angular distance below a given error tolerance.
B. Input features
The input features of the networks based on HOA signalsare pure magnitude and phase spectrograms. In the following,we will call these networks HOA- n -CRNN with n being theorder of the HOA signal. We compare our HOA- n -CRNNsto two other published approaches for FOA DOA estimationwith CRNNs. On the one hand, Adavanne et al. [3] used pureFOA magnitude and phase spectrograms (FOA-CRNN). Ofcourse, HOA-1-CRNN and FOA-CRNN are identical and will be referred to as HOA-1-CRNN in the following. On the otherhand, Perotin et al. [4] proposed using spectrograms of 6-channel features derived from the FOA sound intensity vectoraccording to (3) as input to the CRNN (Intensity-CRNN). Byusing these features, they were able to significantly improvethe localization performance compared to using magnitude andphase spectrograms. − C ( t, f ) (cid:20) I a ( t, f ) I r ( t, f ) (cid:21) (3) I a ( t, f ) and I r ( t, f ) describe the active and reactive intensityvector as a Short-time Fourier transform (STFT) expression ofthe FOA channels and C ( t, f ) is a normalization term. Theycan be computed according to (4), (5), (6). For further detailson acoustic intensity see [4], [24], [25]. I a ( t, f ) = − Re { W ( t, f ) X ∗ ( t, f ) } Re { W ( t, f ) Y ∗ ( t, f ) } Re { W ( t, f ) Z ∗ ( t, f ) } (4) I r ( t, f ) = − Im { W ( t, f ) X ∗ ( t, f ) } Im { W ( t, f ) Y ∗ ( t, f ) } Im { W ( t, f ) Z ∗ ( t, f ) } (5) C ( t, f ) = | W ( t, f ) | + 13 ( | X ( t, f ) | + | Y ( t, f ) | + | Z ( t, f ) | ) (6)The input shape of all the different networks is (50 , , dim in ) , where 50 is the number of frames, 512 thenumber of frequency bins, and dim in the number of inputchannels with dim in = 2( n + 1) for the HOA- n -CRNNsand dim in = 6 for the Intensity-CRNN. The STFT for thecreation of the spectrograms was performed on 640 samples,zero-padded to 1024 samples with a hop-size of 320 samples.For identifying the optimal number of filters ( n filter ), differentvalues ranging from 32 to 1024 were tested for each networkand the value which resulted in the lowest error on thevalidation set was chosen. The best values were 256 for theHOA-1-CRNN and HOA-2-CRNN and 512 for all the othernetworks. V. R ESULTS
As expected, the results belonging to the simulated SRIRsare overall slightly better than those belonging to the realSRIRs. Nevertheless, all models show a good and reliablegeneralization ability. Alltogether, the results presented inFig. 1 and 2 show that the Intensity-CRNN provides the bestlocalization accuracy on both simulated and real SRIRs. Thisunderlines the statement of Perotin et al. [4] that their intensityfeatures are very well suited for deep learning based DOAestimationNevertheless, it can be seen in Fig. 1a and 2a, that theHOA- n -CRNNs perform better with increasing order n on thesimulated data. Both the median and the IQR of the angulardistance become smaller with each additional order. In partic-ular, the additional orders of the SH seem to allow a betterfine localization. Thus, only about 70 % of the predictionsof the HOA-1-CRNN lie within the error tolerance of ◦ , a) Simulated SRIRs.Intensity HOA-1 HOA-2 HOA-3 HOA-4 Network type A ngu l a r d i s t a n ce / ◦ (b) Real SRIRs.Intensity HOA-1 HOA-2 HOA-3 HOA-4 Network type A ngu l a r d i s t a n ce / ◦ Fig. 1. Box plot of angular distances ( ◦ ) for the five different networks using simulated (a) and real (b) SRIRs. The boxes aredrawn from the first to the third quartile. The horizontal line shows the median. The whiskers go from the lowest data stillwithin 1.5 interquartile range (IQR) of the lower quartile to the highest data within 1.5 IQR of the upper quartile. (a) Simulated SRIRs. Error tolerance / ◦ A cc u r ac y / % IntensityHOA-1HOA-2HOA-3HOA-4 (b) Real SRIRs.
Error tolerance / ◦ A cc u r ac y / % IntensityHOA-1HOA-2HOA-3HOA-4
Fig. 2. Accuracies of the different networks as a function of the error tolerance for the simulated (a) and real (b) SRIRs.whereas this is the case for about 90 % of the predictionsof the HOA-4-CRNN. The rough direction, however, seemsalready to be well predictable with the HOA-1-CRNN. Allconsidered networks have an accuracy of about 99 % with anerror tolerance of ◦ .However, the results belonging to the real SRIRs in Fig. 1band 2b show that an improvement of the DOA estimates isonly obtained when the order is increased from 1 to 2. TheHOA-CRNNs of orders 2 to 4 achieve almost identical results.In Fig. 3 the localization accuracy is evaluated as a functionof the SNR of the respective speech signal. As expected,the localization becomes more accurate for each model withincreasing SNR. For both simulated and real SRIRs, a slighttrend can be seen that the advantage of the Intensity-CRNNover the HOA- n -CRNN of orders 3 and especially 4 mainlyexists at relatively high SNR. In the case of poor SNR between0 and 4 dB, the HOA-4-CRNN performs even sligthly betterthan the Intensity-CRNN. Otherwise, the respective order oflocalization accuracy among the models remains the same. VI. C ONCLUSION AND O UTLOOK
In this paper we investigated the influence of the orderof HOA signals on the accuracy of single-speaker DOAestimation of noisy speech with CRNNs. We have shown thatthere is potential in using the additional spatial information ofHOA signals for a CRNN-based DOA estimation. However,when evaluated on real data, it has been shown that theadvantage of this additional information may possibly bereduced in practice due to effects such as a non-perfectsimulation, a limited generalization capability of the models,or additional measurement noise. Rather, it became very clearthat it is highly useful and advisable to extract the informationpresent in the signals in a preprocessing step to make it moreaccessible for the network. Only in low-SNR conditions aslight improvement of the DOA estimation could be achievedby using fourth order Ambisonics signals comparing to theIntensity-CRNN.Since the HOA models seem to perform comparatively wellin acoustically challenging scenarios, we will also investigatethe effect of the Ambisonics order on localization accuracy inmulti-speaker DOA estimation scenarios in the future. Also a) Simulated SRIRs.[0,4] [4,8] [8,12] [12,16] [16,20]
SNR / dB A ngu l a r d i s t a n ce / ◦ IntensityHOA-1HOA-2HOA-3HOA-4 (b) Real SRIRs.[0,4] [4,8] [8,12] [12,16] [16,20]
SNR / dB A ngu l a r d i s t a n ce / ◦ IntensityHOA-1HOA-2HOA-3HOA-4
Fig. 3. Box plots with angular distances of the different networks for different SNR regions and simulated (a) and real (b)SRIRs.based on the physical motivation and interpretation of thesound Intenisty features, it can be suspected that the higher-order models are superior to the Intensity-CRNN there.Furthermore, we want to strengthen our results by additionalevaluations of our models on more data generated from realSRIRs and also on real recordings. In addition, we want to useour presented dataset to estimate additional parameters suchas room volume, reverberation time and frequency-dependentabsorption and scattering coefficients using HOA signals.R
EFERENCES[1] S. Chakrabarty and E. A. P. Habets, “Multi-speaker doa estimation usingdeep convolutional networks trained with noise signals,”
IEEE Journalof Selected Topics in Signal Processing , vol. 13, no. 1, pp. 8–21, 2019.[2] X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, “Alearning-based approach to direction of arrival estimation in noisy andreverberant environments,” in . 2015, pp. 2814–2818, IEEE.[3] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound eventlocalization and detection of overlapping sources using convolutionalrecurrent neural networks,”
IEEE Journal of Selected Topics in SignalProcessing , vol. 13, no. 1, pp. 34–48, 2019.[4] L. Perotin, R. Serizel, E. Vincent, and A. Gu´erin, “Crnn-based multipledoa estimation using acoustic intensity features for ambisonics record-ings,”
IEEE Journal of Selected Topics in Signal Processing , vol. 13,no. 1, pp. 22–33, 2019.[5] S. Kapka and M. Lewandowski, “Sound source detection, localizationand classification using consecutive ensemble of crnn models,” in
Proceedings of the Detection and Classification of Acoustic Scenes andEvents 2019 Workshop (DCASE2019) , New York University, NY, USA,2019, pp. 119–123.[6] A. Politis, A. Mesaros, S. Adavanne, T. Heittola, and T. Virtanen,“Overview and evaluation of sound event localization and detectionin dcase 2019,” Accessed on: Feb. 18, 2021. [Online], Available:http://arxiv.org/pdf/2009.02792v1, 2020.[7] Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. Plumbley,“Polyphonic sound event detection and localization using a two-stagestrategy,” in
Proceedings of the Detection and Classification of AcousticScenes and Events 2019 Workshop (DCASE2019) , New York University,NY, USA, 2019, pp. 30–34.[8] L. Perotin, R. Serizel, E. Vincent, and A. Gu´erin, “Crnn-based jointazimuth and elevation localization with the ambisonics intensity vector,”in , 2018, pp. 241–245. [9] L. Perotin, A. D´efossez, E. Vincent, R. Serizel, and A. Gu´erin, “Re-gression versus classification for neural network based audio sourcelocalization,” in , 2019, pp. 343–347.[10] S. Bertet, J. Daniel, E. Parizet, and O. Warusfel, “Investigation onlocalisation accuracy for first and higher order ambisonics reproducedsound sources,”
Acta Acustica united with Acustica , vol. 99, no. 4, pp.642–657, 2013.[11] A. Politis, J. Vilkamo, and V. Pulkki, “Sector-based parametric soundfield reproduction in the spherical harmonic domain,”
IEEE Journal ofSelected Topics in Signal Processing , vol. 9, no. 5, pp. 852–866, 2015.[12] V. Pulkki, S. Delikaris-Manias, and Politis. A., “Higher-order directionalaudio coding,” in
Parametric Time-Frequency Domain Spatial Audio , pp.141–159. 2018.[13] M. Green and D. Murphy, “Sound source localisation in ambisonicaudio using peak clustering,” in
Proceedings of the Detection and Clas-sification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) .2019, pp. 79–83, New York University.[14] C. Nachbar, F. Zotter, E. Deleflie, and A. Sontacchi, “Ambix - suggestedambisonics format,”
Ambisonics Symposium 2011 , 2011.[15] B. Rafaely,
Fundamentals of Spherical Array Processing , vol. 8,Springer Berlin Heidelberg, Berlin, Heidelberg, 2015.[16] F. Zotter and M. Frank,
Ambisonics , vol. 19, Springer InternationalPublishing, Cham, 2019.[17] A. Wabnitz, N. Epain, C. Jin, and A. van Schaik, “Room acoustics sim-ulation for multichannel microphone arrays,”
International Symposiumon Room Acoustics (ISRA) 2010 , 2010.[18] D. Ackermann et al., “A ground truth on room acoustical analysis andperception (grap),” Technische Universit¨at Berlin, 2018.[19] J. S. Garofolo,
TIMIT: Acoustic-phonetic continuous speech corpus ,Linguistic Data Consortium, Philadelphia, Pa., 1993.[20] R. Hupke, M. Nophut, S. Li, R. Schlieper, S. Preihs, and J. Peissig,“The immersive media laboratory: Installation of a novel multichannelaudio laboratory for immersive media applications,”
Journal of the AudioEngineering Society , 2018.[21] J. Meyer and G. Elko, “A highly scalable spherical microphone arraybased on an orthonormal decomposition of the soundfield,” in ,2002, vol. 2, pp. II–1781–II–1784.[22] T. Dozat, “Incorporating nesterov momentum into adam,” TechnicalReport, Stanford University, 2015.[23] A. Mart´ın et al., “Tensorflow: Large-scale machine learning on heteroge-neous distributed systems,” Accessed on: Feb. 18, 2021. [Online], Avail-able: http://download.tensorflow.org/paper/whitepaper2015.pdf, 2015.[24] V. Pulkki, S. Delikaris-Manias, and A. Politis, Eds.,
Parametric time-frequency domain spatial audio , Wiley, Hoboken NJ USA, 2018.[25] F. Jacobsen, “A note on instantaneous and time-averaged active andreactive sound intensity,”