Relative Transfer Function Inverse Regression from Low Dimensional Manifold
11 Relative Transfer Function Inverse Regressionfrom Low Dimensional Manifold
Ziteng Wang, Emmanuel Vincent, Yonghong Yan
Abstract —In room acoustic environments, the Relative Trans-fer Functions (RTFs) are controlled by few underlying modes ofvariability. Accordingly, they are confined to a low-dimensionalmanifold. In this letter, we investigate a RTF inverse regressionproblem, the task of which is to generate the high-dimensionalresponses from their low-dimensional representations. The prob-lem is addressed from a pure data-driven perspective and asupervised Deep Neural Network (DNN) model is applied to learna mapping from the source-receiver poses (positions and orien-tations) to the frequency domain RTF vectors. The experimentsshow promising results: the model achieves lower prediction errorof the RTF than the free field assumption. However, it fails tocompete with the linear interpolation technique in small samplingdistances.
Index Terms —relative transfer function, inverse regression,deep neural network, manifold learning
I. I
NTRODUCTION T HE acoustic properties of a room environment can befully characterized by the collection of the AcousticImpulse Responses (AIRs). An AIR relates two arbitraryposes inside the room: one for the source and the other forthe receiver. In the multichannel setup, where the receiverincludes multiple microphones, the Relative Transfer Function(RTF) [1] is more often used. The estimation of the RTFis essential in many applications, such as sound field repro-duction [2], dereverberation [3], source localization [4] andsource separation [5]. Over the years, the RTF estimation isbased merely on the measured signals [6]–[8]. While the priorknowledge of the RTF given the source-receiver pose couldprovide additional performance benefits [9], it remains lessstudied.The RTF involves two AIRs, the knowledge of whichusually relies on explicit physical models. For instance, theclassical image-source method [10] is widely used for simu-lating small room acoustics. The method assumes the responseto be contributed by multiple virtual image sources, and thesimulation requires the room to be rectangular and the wallreflection coefficients known. If the room geometry can beestimated in practice, an approximation of the AIR can beobtained accordingly [11], [12]. Considering a predefined posespace only, the AIRs can be alternatively parameterized basedon the harmonic solution to the wave equation [13]. Moreover,in stationary rooms, AIR measurements can be collected inadvance and help the modeling. With a limited number ofmeasurements, compressed sensing with sparsity is introducedto interpolate AIR for the early parts [14] or in the lowfrequency bins [15]. When dense sampling is achievable,the simple data-based linear interpolation turns out to be aneffective approach [16], [17]. In some common scenarios, e.g., in conference rooms orcars, the stationary assumption made above is reasonable sincethe layout does not often change over time. Indeed, the AIRs inthis case are confined to a low-dimensional manifold becausethe source-receiver pose now works as the only varying degreeof freedom. This geometrical property was revealed by themanifold learning paradigm [18] that was first introducedto parameterize linear systems. It was then adopted to RTFmodeling [19], [20] and applied in supervised [21] and semi-supervised source localization tasks [22], [23], which wasto associate the manifold with the source poses. The RTFencodes the Interchannel Level Difference (ILD) in decibelsand the Interchannel Phase Difference (IPD) in radians. Thelow-dimensional structure of the ILD and IPD was shown ina binaural setup [24]. Based on a local linearity assumptionon the manifold, the method of Probabilistic Piecewise AffineMapping (PPAM) was proposed for a 2D sound localizationtask. Specially, PPAM learned a bijective mapping between theposes and the ILD and IPD, whereas only the interaural-to-pose regression was discussed for localization. The bijectivemapping has also been generalized to the cases of multiplesources [25], co-localization of audio sources in images [26]and partially latent response variables [27].In this letter, we raise the problem of RTF inverse regres-sion, which is defined as approximating the high-dimensionalresponses from their low-dimensional representations. Thequestion is: how to acquire prior knowledge of the RTFgiven the source-receiver pose? As deriving an explicit phys-ical model could not always be possible but representativeexamples of the acoustic environment can be collected inadvance, especially with modern smart devices, the problem isaddressed from a pure data-driven perspective. A Deep NeuralNetwork (DNN) model is proposed to directly generate thefrequency domain RTF vector given the source-receiver pose.The DNN model is advantageous in that it learns a globallynonlinear mapping from the data examples while preserving alocal linearity, which matches the manifold structure implicitly.In the experiments, the acoustic space is sampled uniformlyand the DNN model is tested in unseen poses. The evaluationis based on an absolute prediction error measure, while apply-ing the generated RTFs to specific applications is not in thescope here.The rest of this letter is organized as follows. The relevantdefinitions are given in Section II. The RTF inverse regressiontask and the possible solutions including the supervised DNNmodel are explained in Section III. The experimental setupsand results are presented in Section IV and conclusions aredrawn in Section V. a r X i v : . [ c s . S D ] O c t II. D
EFINITIONS
In a reverberant room environment, the RTFs represent thecoupling between a pair of microphones in response to thesource signal. Denote the source as s and the two AIRs fromthe source to the microphones as h , h , the observations attime t are written as a m ( t ) = h m ( t ) ∗ s ( t ) + v m ( t ) m = 1 , (1)where ∗ denotes convolution and v m is the noise componentin the m th microphone. Under the narrowband approximation,the signals in the frequency domain are given by A m ( l, f ) = H m ( f ) S ( l, f ) + V m ( l, f ) (2)where l is the frame index, f is the frequency index, A m , S and V m are the Short Time Fourier Transform (STFT) coefficientsof a m , s and v m , respectively, and H m is the Fourier transformof h m . The RTF is defined as H ( f ) = H ( f ) H ( f ) . (3)The RTFs are known to be governed by these parameters:the size and geometry of the room, the reflection coefficientsof the walls, and the source-receiver poses. Assuming the roomcharacteristics to be stationary over time, the RTFs are thusconfined to a low-dimensional manifold that can be associatedwith the source-receiver poses as H ( f ) = g (Θ s , Θ r ) (4)where g : R L → R D is defined as the mapping functionand Θ s , Θ r are the pose parameters in the intrinsic Carte-sian coordinate system on the embedded manifold. The poseparameters include the position coordinates ( x, y, z ) and thedirection variables (azimuth angle, elevation angle and rotationangle).In anechoic conditions, the translation from the source-receiver pose to the RTF is straightforward under the free fieldassumption, which is given by the direct sounds with (3) and H m ( f ) = exp ( j · πf || Θ r m − Θ s || /c )4 π || Θ r m − Θ s || (5)where j is the complex unit, || · || denotes the Euclidiannorm and c denotes the speed of sound [13]. However, theassociation becomes complex in reverberant conditions.III. RTF INVERSE REGRESSION
The task of RTF inverse regression is to learn the mappingfunction g from a given set of pairwise examples {X :Θ s , Θ r ; Y : H ( f ) } , while no physical constraint is involved.Three possible solutions are discussed in the following.Linear interpolation is the intuitive way. Provided that X , Y have the same geometric structure in their separate space, theresponse of a new pose (cid:98) X can be estimated as (cid:98) Y = 1 (cid:80) α i I (cid:88) i =1 α i Y i (6)where Y i is the response of the neighboring pose X i and α i is the weighting parameter correlated to the spatial distancefrom X i to (cid:98) X . Although PPAM was not proposed for this task, an in-stantiation of g was realized based on a piecewise linearapproximation [25]. The acoustic space is divided into K localregions and each is characterized by the affine transform: Y = K (cid:88) k =1 I ( k )( A k X + b k ) + (cid:15) (7)where I ( k ) = 1 if X lies in the k th local region, and 0otherwise. A k ⊂ R L × D is the weight matrix, b k is thebias vector and (cid:15) is the error term described by the Gaussiandistribution.DNNs feature multiple hidden layers trained through errorbackpropagation. Theoretically, the expression in (7) can beapproximated by a neural network with one hidden layer [28],which can be described as Y = W (1) ζ ( W (0) X + b (0) ) + b (1) (8)where W ( i ) , b ( i ) are the i th layer parameters and ζ ( · ) is anonlinear activation function. More details about the proposedDNN model are given as below. A. Inputs and targets
The inputs of the DNN model are the pose parameters { Θ s , Θ r } ⊂ R L , the effective dimension of which is depen-dent on the degree of freedom in the system, since fixed inputvalues should make no difference to the performance.The target RTFs are complex vectors, which are representedby the real-valued ILD and IPD asILD = 20log | H ( f ) | (9)IPD = arg( H ( f )) . (10)The sine and cosine of the IPD are computed and concatenatedwith the ILD to form the final target vector for training, whichis of D = 513 × dimensions as STFT is performed in points. The setup follows that in [25], nevertheless, alternativetargets could be the real and imaginary parts of the RTF.Since data normalization is known to help the training, anormalized target RTF is considered as H ( f ) = H ( f ) H d ( f ) (11)where H d ( f ) is calculated using (3) and (5). This couldprovide marginal benefit in the latter experiments. B. DNN Architecture and training
The DNN model is a basic feed-forward neural networkwith all the layers linearly connected. The ReLU activationfunction is used for the hidden layers and linear activation isused for the output layer. Given the previous target selection,a local normalization is enforced to the IPD output nodes: o f, − = o f, − (cid:113) o f,s + o f,c (12)which means that the sum of squares of the sine part o f,s andthe cosine part o f,c in the f th frequency bin should equal toone. In the training stage, the layer weights are initialized withGaussian samples (zero mean and deviation (cid:112) / in size) andthe bias vectors are initialized with zeros. The model isoptimized under the Mean Squared Error (MSE) loss criterion.The Adam method [29] is used to update the model parametersand the learning rate is adjusted adaptively. Layer normaliza-tion [30] is applied to the hidden layers to speed up the modelconvergence. Other regularization techniques such as dropoutand batch normalization, are not found helpful here. C. Evaluation metric
As far as we know, there is no established measure toevaluate the approximation error of a high-dimensional vector.A mean absolute error metric is chosen here to straightly showthe performance in each frequency: µ f = 1 N N (cid:88) n =1 || (cid:98) Y n ( f ) − Y n ( f ) || (13)where (cid:98) Y n is the predicted response of the n th test sample. The95% Confidence Interval (CI) of µ f is also calculated to givean idea of the error distribution. The ILD and IPD predictionerrors are treated separately.IV. E XPERIMENTS AND RESULTS
A. Experimental setup validation
In the first place, we validate the experimental setup be-fore reporting further results on simulated data. The CAMILdataset is considered for this task as it includes real-worldrecordings labeled with true poses. A simulated dataset is firstcreated following its original setup and the models are thentested on both the real and simulated data. We consider theexperimental setup is validated if similar trends are observedin the performance. Note that the CAMIL dataset consistsof binaural recordings that involves Head-Related TransferFunctions (HRTFs) and changes the definition of RTF in-verse regression slightly. Nevertheless, this dataset is to ourknowledge the only qualified public dataset, and the previousanalysis can be generalized to this case.The simulation setup is illustrated in Fig. 1(a). The roomsize is 4 × × . The receiver is positioned at [
2, 1, 1.4 ] ,with microphone distance 0.18 m, and the source is fixed at 3meters away in the front. The receiver pose is defined by anazimuth angle in the range of [ − ◦ , ◦ ] and an elevationangle in the range of [ − ◦ , ◦ ] . Data samples are generatedevery ◦ and there are 9600 poses in total. For each pose, aone-second white noise signal is emitted from the source andthen the captured microphone signals are used to calculate thecorresponding RTF [25]. This setup is one way to measure theRTF in practise, though it can be directly computed from thetwo AIRs with (3) in the simulation. https://team.inria.fr/perception/the-camil-dataset/ https://dev.qu.tu-berlin.de/projects/measurements y : m Mic 2 (a) (b)
Sample spacesource2.0m 1.0mx: 4m y : m Mic 2Mic 1 1.5m m Fig. 1. Top view of the simulation setup: (a) simulated CAMIL dataset, (b)uniform acoustic space sampling. One every two samples (marked as dots inthe sample space 2 × × For the DNN model, 3 hidden layers each with 1024 nodesare used throughout based on heuristic search. One every twosamples in the dataset is used for training. Half of the leftis used for development and the other half for testing. Earlystopping is applied to avoid overfitting when the developmentset error no longer decreases after a patience of 5 epoches.The method of PPAM [25] (source code provided therein) isinvestigated and the parameter K is chosen from {
64, 128,256 } to achieve the best performance. It is worth to noteagain that PPAM was not proposed for the task here. Linearinterpolation is also considered, but in a simplified way: (cid:98) Y n ( f ) = [ Y n, ( f ) + Y n, ( f )] / (14)where Y n, and Y n, are the response vectors of the spatiallyadjacent poses. The interpolated IPD is normalized as in (12).The results on the real and simulated CAMIL datasets aregiven in Table I. The mean values µ f are further averaged overthe frequencies. Similar trends are observed: DNN achieveslower prediction errors than PPAM in ILD but slightly highererrors in IPD, and both methods fail to compete with linearinterpolation. This should validate our simulation setup. Thereason that DNN and PPAM perform relatively better onthe simulated data could be due to the simplification of thesimulation setup. TABLE IILD/IPD
PREDICTION ERRORS ON THE REAL ( LEFT PART ) ANDSIMULATED ( RIGHT PART ) CAMIL
DATASETS .( MEAN ± CI BOUND )ILD IPD ILD IPDPPAM 1.73 ± .057 0.33 ± .015 1.42 ± .066 0.27 ± .013DNN 1.42 ± .047 0.34 ± .014 1.23 ± .042 0.29 ± .013Linear 1.03 ± .035 0.18 ± .011 1.07 ± .034 0.20 ± .011 The results also accord with the finding that the acousticresponses are locally linear in small distances [20]. Mean-while, linear interpolation has the largest parameter size here,which is N ( D + L ) with N being the size of the trainingexamples. To show that the DNN model implicitly learnsthe RTF manifold structure, the target ILDs and the DNNgenerated ones are visualized in the low-dimensional space inFig. 2. The manifolds resemble each other in the sense thatthe samples are clearly organized according to the poses. (a) ............................................. .. .. .................................................................................................................................... ......................... . .............. ................................................................................................................................................. .. ............................................................................................................................................................ . .... .......... ................................................................................................................................................. . .................................................................................................................................................................. ........................................................................................................................................................ ... .. .. ..................................................................................................................................................... ....... . .. ........... ........................................................................................................................................... ....... .. . .. .. ......................................................................................................................................................... ... . .. ............................................................................................................................................. ... ...... ..... .. .. . .. ........................................................................................................................................... ................ .. ... .. .................................................................................................................................... ... ............ ... . ... . ... ... ....................................................................................................................................... ............ .. .... ... ...... .................................................................................................................................. .. . ......... ..... .. .... .. .......................................................................................................................................... ............ ... ................................................................................................................................................. ... ........... . . ...... .. ......................................................................................................................................... .. ......... .. ....... .. .................................................................................................................................... .... ....... ..... .... ........ ................................................................................................................................................ .................. ............................................................................................................................................. ... ................ ....... ........................................................................................................................................ . . .............. .. ..... ...................................................................................................................................... .... ............ ...... . .. ......................................................................................................................................... ... . .............. .. .. .. .......................................................................................................................................... ..................... . ......................................................................................................................................... . .............. .... .. .. . ..................................................................................................................................... ....... ........... .. .... ... . ..................................................................................................................................... ................ . ... . .... ... ................................................................................................................................... ... ................. ... .. .. .. . . .................................................................................................................................. .................. ... . . .. ... . . ................................................................................................ (b) .................................................................................................... ........... ...................................................................................................................................................... ............. ............................................................................................................................................................................................................................................................................................................................... .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. ....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ............................................................................................................................................................................................................................................................................................................................... ............................................................................................................................................................ ................................................................................................................................................................... ...................... .............................................................................................................................................................................................................................................................................................................. ......................................................................................................................................................................................................................................................................................................................... . .... ...................................................... Fig. 2. 2D visualization of the manifold using local tangent space alignmentin the scikit-learn toolbox: (a) target ILDs, (b) DNN generated ILDs. Sampleswith the same elevation angle have the same color.
B. Performance on simulated data
In the rest of the experiments, a more general setup isconsidered, e.g., in conference rooms, where the microphonepositions are fixed and the source moves around in limitedspace. Following the previous setup, the receiver pose is fixedand the source pose has three degrees of freedom while HRTFis no longer used. The acoustic space is sampled uniformly asshown in Fig. 1(b). Dense data samples are generated every 1cm and there are in total 200 × ×
100 poses in the trainingset. 10,000 extra poses are randomly chosen for the evaluation.The performance results are given in Table II. The modelis evaluated w.r.t. sample distances {
1, 2, 4, 8 } cm in thetraining set and larger sample distance also means less trainingdata. The prediction errors clearly go up with larger sampledistances. This is more obvious for linear interpolation (using(14) with interpolation samples along the z coordinate whichreports lower errors than along the x, y coordinates), the meanerrors in ILD and IPD are more than double in the 2 cm casethan that in the 1 cm case. DNN loses to linear interpolationagain in small distances but slightly surpasses it starting from4 cm. Note that the local linearity of the RTF manifold underthe Euclidean distance measure [20] holds within around 3.5cm in this case. Here the performance of using H d ( f ) , that isthe prior knowledge of the RTF responses we can have fromthe free field assumption, to approximate the targets is givenby: ILD, 3.53 ± ± TABLE IIILD/IPD
PREDICTION ERRORS OF
DNN
AND LINEAR INTERPOLATIONW . R . T . SAMPLE DISTANCE . (
MEAN ± CI BOUND )1cm 2cm 4cm 8cmDNN-ILD 3.01 ± .044 3.03 ± .046 3.10 ± .047 3.23 ± .049Linear-ILD 0.92 ± .017 2.21 ± .036 3.18 ± .049 3.43 ± .052DNN-IPD 0.46 ± .008 0.48 ± .009 0.50 ± .009 0.54 ± .009Linear-IPD 0.17 ± .005 0.38 ± .008 0.53 ± .009 0.57 ± .010 The mean errors in different frequencies are plotted inFig. 3. The errors in the low frequencies are relatively smaller,especially below the spatial aliasing frequency f a as shown by H d ( f ) (direct) and the DNN model. In the high frequencies,the target values vary more rapidly on the manifold and thevariations become harder to be captured by linearity. It isshown that DNN outperforms linear interpolation (LI) onlyin the high frequencies. I L D e rr o r frequency bin I P D e rr o r direct LI 1cmLI 2cmLI 4cm fafa DNN 4cm from left to right
Fig. 3. The mean errors in ILD and IPD along the frequency axis. f a marksthe spatial aliasing frequency. TABLE IIIILD/IPD PREDICTION ERRORS W . R . T . SNR. ( MEAN ± CI BOUND )30dB 20dB 10dBDNN-ILD 3.03 ± .046 3.03 ± .046 3.05 ± .046Linear-ILD 2.31 ± .036 2.32 ± .036 2.36 ± .037DNN-IPD 0.48 ± .009 0.48 ± .009 0.48 ± .009Linear-IPD 0.40 ± .008 0.40 ± .008 0.41 ± .008 Considering the usage of white noise source signals in theRTF measurement process, there exist measurement errors:ILD 0.93 ± ± {
30, 20, 10 } dB. The resultsfor the 2 cm sample distance are given in Table III. Thepredictions errors differ slightly in the three cases, whichmeans that both methods are quite robust to noise.V. C ONCLUSION
In this letter, we raised the RTF inverse regression problemfor the first time and addressed it in the simplified stationaryroom environments. The trained DNN model directly gener-ated the high-dimensional acoustic responses given the low-dimensional source poses. It performed better than the freefield model and also captured the RTF manifold structureimplicitly. The superior performance of linear interpolation insmall distances supported the locality property of the RTFs.Experimental evaluations in a specific application or in adifferent test environment would be some valuable work tobe addressed in the future.A
CKNOWLEDGMENT
The authors would like to thank Antoine Deleforge andSharon Gannot for their helpful discussions and comments. R EFERENCES[1] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement usingbeamforming and nonstationarity with applications to speech,”
IEEETransactions on Signal Processing , vol. 49, no. 8, pp. 1614–1626, 2001.[2] T. Betlehem and T. D. Abhayapala, “Theory and design of soundfield reproduction in reverberant rooms,”
The Journal of the AcousticalSociety of America , vol. 117, no. 4, pp. 2100–2111, 2005.[3] Y. Lin, J. Chen, Y. Kim, and D. D. Lee, “Blind channel identification forspeech dereverberation using l1-norm sparse learning.” in
NIPS , 2007,pp. 921–928.[4] X. Li, L. Girin, R. Horaud, and S. Gannot, “Estimation of relative trans-fer function in the presence of stationary noise based on segmental powerspectral density matrix subtraction,” in
IEEE International ConferenceonAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 2015, pp.320–324.[5] Z. Koldovsk´y, J. M´alek, and S. Gannot, “Spatial source subtraction basedon incomplete measurements of relative transfer function,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 23, no. 8,pp. 1335–1347, 2015.[6] I. Cohen, “Relative transfer function identification using speech signals,”
IEEE Transactions on Speech and Audio Processing , vol. 12, no. 5, pp.451–459, 2004.[7] R. Talmon, I. Cohen, and S. Gannot, “Relative transfer functionidentification using convolutive transfer function approximation,”
IEEETransactions on audio, speech, and language processing , vol. 17, no. 4,pp. 546–555, 2009.[8] M. Taseska and E. A. Habets, “Relative transfer function estimationexploiting instantaneous signals and the signal subspace,” in . IEEE, 2015, pp.404–408.[9] R. Talmon and S. Gannot, “Relative transfer function identification onmanifolds for supervised GSC beamformers,” in
Proceedings of the 21stEuropean Signal Processing Conference (EUSIPCO) . IEEE, 2013, pp.1–5.[10] J. B. Allen and D. A. Berkley, “Image method for efficiently simulatingsmall-room acoustics,”
The Journal of the Acoustical Society of America ,vol. 65, no. 4, pp. 943–950, 1979.[11] A. Asaei, M. E. Davies, H. Bourlard, and V. Cevher, “Computationalmethods for structured sparse component analysis of convolutive speechmixtures,” in
IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2012, pp. 2425–2428.[12] A. Asaei, M. Golbabaee, H. Bourlard, and V. Cevher, “Structured spar-sity models for reverberant speech separation,”
IEEE/ACM Transactionson Audio, Speech and Language Processing (TASLP) , vol. 22, no. 3, pp.620–633, 2014.[13] P. Samarasinghe, T. Abhayapala, M. Poletti, and T. Betlehem, “Anefficient parameterization of the room transfer function,”
IEEE/ACMTransactions on Audio, Speech and Language Processing (TASLP) ,vol. 23, no. 12, pp. 2217–2227, 2015.[14] R. Mignot, L. Daudet, and F. Ollivier, “Room reverberation reconstruc-tion: Interpolation of the early part using compressed sensing,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 21,no. 11, pp. 2301–2312, 2013. [15] R. Mignot, G. Chardon, and L. Daudet, “Low frequency interpolationof room impulse responses using compressed sensing,”
IEEE/ACMTransactions on Audio, Speech and Language Processing (TASLP) ,vol. 22, no. 1, pp. 205–216, 2014.[16] T. Nishino, S. Kajita, K. Takeda, and F. Itakura, “Interpolating headrelated transfer functions in the median plane,” in
IEEE Workshop onApplications of Signal Processing to Audio and Acoustics . IEEE, 1999,pp. 167–170.[17] E. Vincent, J. Barker, S. Watanabe, J. Le Roux, F. Nesta, and M. Matas-soni, “The second CHiME speech separation and recognition challenge:Datasets, tasks and baselines,” in
IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 2013, pp.126–130.[18] R. Talmon, D. Kushnir, R. R. Coifman, I. Cohen, and S. Gannot,“Parametrization of linear systems using diffusion kernels,”
IEEE Trans-actions on Signal Processing , vol. 60, no. 3, pp. 1159–1173, 2012.[19] B. Laufer, R. Talmon, and S. Gannot, “Relative transfer functionmodeling for supervised source localization,” in
IEEE Workshop onApplications of Signal Processing to Audio and Acoustics (WASPAA) .IEEE, 2013, pp. 1–4.[20] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “A study on manifoldsof acoustic responses,” in
International Conference on Latent VariableAnalysis and Signal Separation . Springer, 2015, pp. 203–210.[21] R. Talmon, I. Cohen, and S. Gannot, “Supervised source localizationusing diffusion kernels,” in
IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics (WASPAA) . IEEE, 2011, pp. 245–248.[22] B. Laufer, R. Talmon, and S. Gannot, “Semi-supervised sound sourcelocalization based on manifold regularization,”
IEEE/ACM Transactionson Audio, Speech and Language Processing (TASLP) , vol. 24, no. 8, pp.1393–1407, 2016.[23] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “Semi-supervisedsource localization on multiple-manifolds with distributed microphones,” arXiv preprint arXiv:1610.04770 , 2016.[24] A. Deleforge and R. Horaud, “2D sound-source localization on the bin-aural manifold,” in
IEEE International Workshop on Machine Learningfor Signal Processing (MLSP) . IEEE, 2012, pp. 1–6.[25] A. Deleforge, F. Forbes, and R. Horaud, “Acoustic space learningfor sound-source separation and localization on binaural manifolds,”
International Journal of Neural Systems , vol. 25, no. 01, 2015.[26] A. Deleforge, R. Horaud, Y. Y. Schechner, and L. Girin, “Co-localizationof audio sources in images using binaural features and locally-linearregression,”
IEEE/ACM Transactions on Audio, Speech and LanguageProcessing (TASLP) , vol. 23, no. 4, pp. 718–731, 2015.[27] A. Deleforge, F. Forbes, and R. Horaud, “High-dimensional regressionwith Gaussian mixtures and partially-latent response variables,”
Statisticsand Computing , vol. 25, no. 5, pp. 893–911, 2015.[28] A. Pinkus, “Approximation theory of the mlp model in neural networks,”
Acta Numerica , vol. 8, pp. 143–195, 1999.[29] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[30] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXivpreprint arXiv:1607.06450arXivpreprint arXiv:1607.06450