C-SL: Contrastive Sound Localization with Inertial-Acoustic Sensors
CC-SL: C
ONTRASTIVE S OUND L OCALIZATION WITH I NERTIAL -A COUSTIC S ENSORS
A P
REPRINT
Majid Mirbagheri
University of Washington [email protected]
Bardia Doosti
Indiana University Bloomington [email protected] A BSTRACT
Human brain employs perceptual information about the head and eye movements to update the spatialrelationship between the individual and the surrounding environment. Based on this cognitive processknown as spatial updating, we introduce contrastive sound localization (C-SL) with mobile inertial-acoustic sensor arrays of arbitrary geometry. C-SL uses unlabeled multi-channel audio recordingsand inertial measurement unit (IMU) readings collected during free rotational movements of the arrayto learn mappings from acoustical measurements to an array-centered direction-of-arrival (DOA)in a self-supervised manner. Contrary to conventional DOA estimation methods that require theknowledge of either the array geometry or source locations in the calibration stage, C-SL is agnostic toboth, and can be trained on data collected in minimally constrained settings. To achieve this capability,our proposed method utilizes a customized contrastive loss measuring the spatial contrast betweensource locations predicted for disjoint segments of the input to jointly update estimated DOAs and theacoustic-spatial mapping in linear time. We provide quantitative and qualitative evaluations of C-SLcomparing its performance with baseline DOA estimation methods in a wide range of conditions.We believe the relaxed calibration process offered by C-SL paves the way toward truly personalizedaugmented hearing applications.
Humans localize sounds by comparing inputs across the two ears, resulting in a head-centered representation ofsound-source location [1]. When the head moves, brain combines inertial information about head movement with thehead-centered estimate to correctly update the world-centered sound-source location in a cognitive process known asauditory spatial updating [2, 3]. Existing methods for sound localization with microphone arrays differ from humanauditory system in two major aspects: (i) unlike humans who adapt to changes in auditory localization cues withoutsupervision [4], in order to operate, these algorithms rely on either specific array geometries or access to sample soundswith known spatial information. (ii) these methods do not account for array movements as they are mostly designed forstatic applications. With the advent of augmented reality (AR) technologies embodied in mobile devices such as smartglasses and headphones, addressing these gaps can extend versatility of these algorithms to more applications in thisdomain.Calibration of the arrays in conventional source localization methods involves measuring array responses to signalscoming from known directions when they cannot be analytically determined as a function of the array geometry.Once the array is calibrated, these methods use stored responses as some form of lookup table. Popular methods inthis category consist of those based on steered response power (SRP) [5] and subspace approaches such as multiplesignal classification (MUSIC) [6]. The grid search involved in these methods is, however, usually of considerablecomputational cost, while the performance is restricted by the grid resolution.In an effort to overcome these issues, more recently supervised learning algorithms using deep neural networks (DNN)have gained significant attention in the field [7, 8, 9, 10, 11, 12]. Given acoustic measurements with known spatial labels,in the form of a single direction or spherical intensity field representations, DNN-based methods solve a nonlinearregression problem to predict labels from measurements via an iterative gradient-based optimization algorithm. A a r X i v : . [ c s . S D ] J un -SL: Contrastive Sound Localization with Inertial-Acoustic Sensors A P
REPRINT common problem of learning-based methods is their sensitivity to mismatches between distributions of data used fortraining and test. This issues can be specifically more severe for mobile arrays with microphones that are fit in theear or installed on head-mounted or hand-held devices. The directional pattern of such arrays depends on not onlyrelative positioning of the microphones but also the unique anatomical geometry of the users’ head/ear, the devicefit, or how the device is handled by the user. On one hand, augmenting training sets with all such variabilities is ingeneral an infeasible task, and this eliminates the possibility of calibrating the array prior to deployment. On the otherhand, collection of acoustic data with clean spatial labels cannot currently take place on a per user basis as it requireselaborate lab setups or computationally-expensive simulations.Contrastive learning is an emerging paradigm proposed to overcome data limitations of supervised methods throughself-supervision namely automatic labeling of data by comparing different views of it across time, space, or sensormodalities [13, 14, 15]. This paradigm has been successfully used for visual object detection [16, 17], and audio-visualsource localization [18, 19, 20, 21]. Studies in neuroscience suggest that human brain utilizes predictive coding, aspecial form of self-supervision, to encode sound attributes [22]. Spatial updating process in brain also by nature usesa contrastive measure based on spatial displacement of the head to update head-centered sound source location asthe individual moves [2, 3]. Inertial information involved in calculation of head attitude and motion is provided byvestibular organs in the inner ear. While this process has been mostly investigated in the context of localization inference,a contrastive learning framework for sound localization based on spatial updating that imitates plasticity in spatialauditory processing is yet to be desired. Such a framework will bridge the gaps between traditional DOA estimationmethods and human spatial auditory processing. In applications, inertial information has been previously utilized toincrease robustness of visual odometry [23, 24] and simultaneous localization and mapping (SLAM) systems [25].
Contributions
In this paper, we propose to the best of our knowledge the first contrastive learning framework for soundlocalization with inertial-acoustic sensors based on cognitive process of auditory spatial updating. Our algorithm, named
C-SL , is able to localize both narrowband and wideband sources, and in contrast to existing DNN and grid search-basedmethods, is agnostic to the array geometry and the knowledge of source locations in the calibration process. The onlyassumption we make is that during training there is only one far-field source present, and that the location of this sourceis approximately piece-wise constant in a reference coordinate frame which we refer to as world-frame in the rest ofthe paper. To train our model, we use a customized loss that leverages this assumption and minimizes spatial contrastbetween predictions for consecutive segments of input in the calibration stage. In the next section, we describe the datamodel followed by how the contrastive loss is computed, and the model architecture.
Assuming a single far-field sound wave impinging on a microphone array, the output of the microphone with the index m ∈ { , . . . , M } is given in the short-time Fourier transform (STFT) domain by: Y mk,n = H mk ( r sn ) S k,n + V mk,n (1)where S k,n is the source signal, H mk ( r sn ) is the acoustic transfer functions (ATF) of the source at location r sn withrespect to m -th microphone, V mk,n models noise and reverberation, and k and n are the frequency and time frame indices,respectively. Since we are interested about far-field localization, we denote the source location as a 3-D vector onthe unit sphere, r s ∈ S . With this definition, the locations will be the same for all microphones, hence referred to as sensor-frame direction. Throughout the paper, bold symbols represent M -dimensional vectorized version of quantitiesrelated to the microphone array, (cid:104)· , ·(cid:105) is the inner product, (cid:107) x (cid:107) denotes (cid:96) norm of a vector, and ˆ x = x (cid:107) x (cid:107) for all vectors x (cid:54) = 0 .During train data collection the array is rotated in all directions to densely sample acoustic measurements along arbitrarytrajectories on S . With a 9-DOF inertial measurement unit (IMU) attached to the array, orientations of the arraywith respect to the earth (world) frame, represented by quaternions or Euler angles, can be calculated from raw IMUreadings [26]. Given the correspondence between orientations and rotation matrices in 3-D space [27], we assume thatfor any given time frame we know the corresponding rotation matrix R n ∈ SO (3) with which we can transform anydirection in the sensor coordinate to the world frame coordinate by: r wn = R n r sn (2)2-SL: Contrastive Sound Localization with Inertial-Acoustic Sensors A P
REPRINT
Spatial Constancy:
Considering that r w changes at a slow rate (in contrast to r s ), we assume it to be (approximately)constant over time intervals, denoted by { I i } N i i =1 , with I i = { n } n i ≤ n REPRINT described above avoids this situation by considering only subsets with likely most different array orientations andmaking sure their world frame centroids are matching. Randomizing break points across training epochs improvesstochasticity and consequently generalizability of the model.Similar to contrastive one, the sub-contrastive loss is invariant to reflection of predicted sensor frame directions withrespect to the origin. This sign ambiguity in the predictions can be easily resolved in many cases via a postprocessingstage in which f θ is negated based on additional criteria. For instance, when simple knowledge about relative position ofmicrophones such as “mic A coordinate in the sensor frame has a larger value on the x -axis than mic B” is available, acomparison of intensities or delays of sounds received at the two microphones can determine if the predicted directionsshould be reflected or not. Alternatively, orientation of the array at the beginning of data collection can be set withrespect to the source in a way that a general condition such as (cid:104) r sn =0 , ˆ i (cid:105) > is enforced and later used to disambiguatethe mapping.The centroids computed in (5) can be interpreted as denoised approximation of predicted world-frame directions. At thetime of training, we need two versions of this estimate for each interval to make contrastive learning possible. Howeverat the time of inference, there is no such need, and centroids can be computed directly over all time-frequency binswithin the each interval providing that they all belong to the same source. In multi-source conditions, world-framepredictions for different bins appear in clusters representing different sources to which they belong. As we will see inSection 3, in such situations, we can run a clustering algorithm on these predictions in the same vein as [28], and useassociated cluster centers as denoised approximation of world-frame directions for each bin. Regardless of number ofsources, denoised sensor-frame directions are computed by transforming the world-frame centroids back into the sensorframe for each time-frequency bin.In contrast to quadratic time of contrastive loss, computation of the sub-contrastive loss only takes linear time withrespect to the number of time-frequency bins resulting in a very efficient training of the model by C-SL. We design f θ as a multi-layer perceptron (MLP), agnostic to the underlying spatial configuration of the array (asopposed to a convnet, for example). The MLP we use consists of three hidden layers with 1024, 512, and 256 units.Each hidden layer is followed by a (parameter-free) weight normalization layer [29] and a standard ReLU non-linearity.The third hidden layer provides the input to a linear prediction layer of size three.It should be noted that while f θ could be optimized separately for each frequency, we opt for a single mappingconditioned on frequency in the view of the fact that the array spectral profile is inherently low dimensional. In order to evaluate the proposed C-SL, we synthesize a dataset consisting of hybrid acoustic-inertial data in the samevein as most DNN-based methods which need large amount of data [7, 8, 9, 11]. Our dataset consists of multiplerecording sessions. Each session simulates acoustic-inertial data from one interval described in section 2.1. We simulatemeasurements in a × × room with point sources randomly placed away from the room center. Sourcelocations remain consistent within a particular session, but vary from session to session. Without loss of generality, weuse an array with cubical configuration with 8 omni-directional microphones positioned at the corners of a cube withan edge length of and center of mass always at the room center. For the array motion, we consider rotations at aconstant angular velocity of magnitude π rad / s and fixed but random axes for each session. Orientation of the arrayat the beginning of each session is set to a unit quaternion randomly chosen from S . Translational motion was notconsidered in our experiments in order to avoid violating the far-filed assumption. We calculated room impulse responsesin five different reverbrant room conditions, four with fixed reverberation times, and one with mixed reverberation times.In all conditions, we used a GPU-based geometrical acoustics simulator, gpuRIR [30] to model sound propagation andreverberation based on the rectangular room image-source model [31]. In the first four conditions, we set the value ofbroadband reverberation time of the room T respectively to 0 (anechoic), 250, 500, and 750 ms . For the last condition,we randomly sampled T values from the range − 750 ms for each session. We refer to these five conditions as C anechoic , C , C , C , and C mixed . To generate one session with a single source in each of these conditions, onedry speech recording from the TIMIT corpus [32] with an average length of approximately was convolved with thesimulated room impulse responses. We split 6300 utterances in the corpus into three subsets of size 5012, 638, and 638respectively for train, test and validation. All utterances in the three splits were then spatialized each resulting in onesession in the corresponding split. All recordings, sampled at 16 kHz , were transformed into the STFT domain using4-SL: Contrastive Sound Localization with Inertial-Acoustic Sensors A P REPRINT Algorithm 1 C-SL Training θ ← Initialize model parameters. while not converged do B ⊂ { , . . . , N i } ← random mini-batch of intervalindices Y k,n , R n , ← data at intervals { I i } i ∈ B { ( I i, , I i, ) } ← sub-intervals with random breakpoints ˜ r sk,n ← f θ ( Y k,n , ˜ k ) ˜ r wk,n ← R n ˜ r sk,n ¯ r wi,l ← (cid:92) (cid:32) (cid:80) n ∈ I i,l ,k ˜ r wk,n (cid:33) i ∈ B, l = 1 , L sub-cont ← (cid:80) i ∈ S (cid:13)(cid:13) ¯ r wi, − ¯ r wi, (cid:13)(cid:13) θ ← A DAM ( (cid:79) θ , L sub-cont , θ ) end while if reflection condition satisfied then f θ ← − f θ end if STFT STFTSTFTSTFTSTFTSTFTSTFTSTFT f θ f θ f θ f θ f θ f θ f θ f θ Pooling Pooling S → W S → W S → W S → W S → W S → W S → W S → W L sub-cont Y k,n ˜ r sk,n ˜ r wk,n ¯ r wi, ¯ r wi, R n I i I i, I i, !" Figure 1: Overview of the proposed framework forContrastive Sound Localization (C-SL). Sensor-framepredictions are transformed into the world frame( S → W ) by the help of inertial information pro-vided by the IMU. During training, sub-contrastiveloss measures distance between world-frame predic-tions aggregated over time-frequency bins within sub-intervals of data.frames of length 25 ms , hop length 10 ms and a hanning window. Rotation matrices associated with time frames werecalculated based on angular velocity and initial orientation of the array chosen for the session and the timestamp of theframes.The input features to the network are × vectors computed for each time-frequency bin, consisting of real andimaginary parts of the array STFT coefficients concatenated and normalized to a unit-norm × vector and anadditional feature representing normalized frequency of the bin. The normalization of acoustic features instructs themodel to disregard content and distance-related variations of intensity across time and frequency.In order to save memory and also satisfy the far-filed assumption, during training from each session we only picked binsin the frequency range 340 Hz to whose original STFT magnitudes were greater than some ratio (arbitrarily setto -40 dB) of the maximum magnitude over all the bins with the same frequency. We optimized parameters of the model iteratively on selected mini-batches of simulated sessions from C mixed dataset.We used the Adam optimizer [33] with β = 0 . and β = 0 . , a learning rate of − , and batch size of 8. Allmodels were trained with 2 GPUs for 300 epochs. The procedural training details are summarized in Algorithm 1.We used the angle (in degrees) between estimated sensor-frame directions and their ground truth values used in thesimulation, formulated by σ (¯ r s , r s ) = 180 /π · cos − ( (cid:104) ¯ r s , r s (cid:105) ) , as the DOA estimation error metric in our evaluations.In particular for C-SL, ¯ r s refers to the final denoised sensor-frame estimates. The distinguishing characteristics of C-SL is its self-supervised nature and the new applications made possible becauseof that, most notably when source locations are not available for array calibration. In order to demonstrate this capability,we run C-SL under a wide range of conditions and compared its performance with two baseline methods that leverageknowledge of array transfer functions: the well-studied SRP-PHAT [5] algorithm, and another approach, named LSDDwith soft time-frequency masks [34], recently proposed for highly-reverbrant environments. In summary, SRP-PHATestimates the sensor directions by the maximum of the normalized cross-power spectral density (CPSD), steered in all Existing DNN-based methods could not be trained on our dataset since they required both source and array to be stationary forat least several seconds. A P REPRINT Table 1: Comparison of DOA estimation errors (in degree) in single-source condition evaluated for different reverbera-tions times, and window lengths L win .Method L win (s) C Anechoic C C C SRP-PHAT [5] 0.05 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Full (~3) . ± . . ± . . ± . . ± . LSDD [34] 0.05 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Full (~3) . ± . . ± . . ± . . ± . C-SL (proposed) 0.05 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Full (~3) . ± . . ± . . ± . . ± . possible directions { r j } Jj =1 i.e. ¯ r sn, SRP = argmax r j (cid:80) k (cid:12)(cid:12)(cid:12)(cid:12) M (cid:80) m =1 A mk ∗ ( r j ) Φ m, k,i | Φ m, k,i | (cid:12)(cid:12)(cid:12)(cid:12) where A mk ( r j ) = H mk ( r j ) /H k ( r j ) , and Φ m, k,n = E ( Y mk Y k ∗ ) is the cross-power spectral density between the m -th and first microphone signals estimated for awindow I n centered at time index n . LSDD method directly uses similarity between array outputs and precomputedATFs weighted by a mask measuring direct path dominance to estimate DOAs. In particular, it first computes aspatial spectrum for each bin, φ k,n ( r j ) = arccos |(cid:104) H Hk ( r j ) , Y k,n (cid:105)|(cid:107) H k ( r j ) (cid:107)(cid:107) Y k,n (cid:107) where ( · ) H is the Hermitian transpose. Soft masksare then calculated by w k,n = min j φ k,n ( r j ) . Finally, it finds sensor-frame direction for each interval through a gridsearch: ¯ r sn, LSDD = argmin r j (cid:80) k,n (cid:48) ∈ I n w − αk,n (cid:48) φ k,n (cid:48) ( r j ) in which α > is a selectivity factor. We used a uniform grid of ◦ resolution consisting of 13744 directions for both LSDD and SRP-PHAT. Both of these methods estimate DOAs forwideband sources i.e. they use moving windows to estimate one direction for the center time frame. We found outSRP-PHAT performed best when time frames were extracted in the frequency range 340 Hz to . For LSDD, wechose the frequency range to and selectivity factor α = 3 as prescribed in [34] for a similar cubicalarray. An important factor for the performance of DOA estimation methods is the duration of the windows over whichthey apply the pooling. While shorter windows are desired for moving sources, longer ones can improve accuracy asthey provide more observations. To investigate this trade-off, we evaluated all three methods with moving windowsextracted from datasets with fixed T and five different durations, L win = 0 . , 0.2, 0.5, . , and "Full" referring tothe case when the full sentence (~ ) was used for the pooling. As shown in Table 1, all three methods perform betterwith longer window lengths as the reverberation in the environment increases. However with an increase in windowlengths, performance of LSDD and SRP-PHAT eventually drops while C-SL consistently performs better and betterwith more observations becoming available in all four conditions. This can be explained by the fact that C-SL appliesthe pooling in the world frame whereas the other two do that in the sensor frame. When the motion of the array and thatof source are independent (e.g. stationary sources) directions in the world frame vary slower thus C-SL benefits betterfrom longer windows.In the second experiment, we performed an assessment of the confidence weights predicted by C-SL. To find theseweights, we calculated (cid:96) norm of the MLP output for all time-frequency bins extracted from the test dataset in C mixed condition. Figure 2(a) illustrates the scatter plot of the bins with weights in the 95-percentile for a highly reverberatedsession in this dataset ( T = 750 ms ) overlaid on the session spectrogram. As expected, majority of these bins areconcentrated around signal onsets at frequencies carrying higher energy. We also examined how these confidence scoresare related to sensor frame estimation errors calculated by σ ( r s , ˆ˜ r s ) . Figure 2(b) shows error curve vs. confidenceweights found using a quantile-based binning of time-frequency bins. The monotonic decrease in average errorsindicates that the model has successfully learned to predict uncertainty in estimations.In our last experiment, we investigated application of C-SL at inference time in a multi-source environment. In suchconditions, it can be assumed that each time-frequency bin is dominated by one source, therefore finding location6-SL: Contrastive Sound Localization with Inertial-Acoustic Sensors A P REPRINT Time [s] F r e qu e n c y [ H z ] (a) Confidence E rr o r [] (b) Figure 2: Evaluation of confidence weights predicted by C-SL: (a) Scatter plot of time-frequency bins (red pixels)with estimated confidence in the 95-th percentile overlaid on a sample spectrogram. (b) Sensor-frame direction errors σ ( r s , ˆ˜ r s ) vs. predicted confidence (cid:107) ˜ r s (cid:107) estimated by C-SL averaged over percentile groups of time-frequency bins inthe C mixed dataset.of the sources can be cast as a clustering task where time-frequency bins are assigned to different clusters basedon their predicted world frame direction. While many approaches could be used for the clustering, we opted for anon-parametric kernel density estimation (KDE) based technique to detect dominant directions. In this method, givenestimated world-frame directions for an ensemble of bins within a window, we first approximate the weighted density ofdirections on a uniform grid (same as the one used in the first experiment) by ψ n ( r j ) = (cid:80) k,n (cid:48) ∈ I n (cid:13)(cid:13)(cid:13) ˜ r wk,n (cid:48) (cid:13)(cid:13)(cid:13) e − σ ( r j , ˆ˜ r wk,n (cid:48) ) /α where we set α = 1 ◦ as the bandwidth of the kernel. Given the maximum number of sources N src , we then find localmaxima of function ψ on the grid, and pick N src peaks with highest density as the estimates of source direction for thewindow. Azimuth [ ◦ ] − − − − E l e v a t i o n [ ◦ ] − − − − − Figure 3: Log-scale kernel density values esti-mated by C-SL for a sample time window oflength 200 ms . ( · ) and ( ××× ) depict predicted andground truth source locations on the grid.Table 2: C-SL estimation errors (in degree) intwo-speaker scenario computed for differentwindow lengths in anechoic condition. L win (s) d w-chamfer . ± . . ± . . ± . . ± . For this experiment, we created a new test dataset in anechoiccondition consisting of 600 sessions each with two sources. Togenerate each session, two sentences were randomly selected fromthe test split we generated before and spatialized according torandom independent locations selected for each of the sourcesand motions of the array similar to the single source condition.Spatialized sounds from the two sources were then added together(after padding the shorter one with zeros at the end) to generate therecordings at each microphone. Finally, time windows extractedfrom these recordings with at least one source present were usedto evaluate the performance of C-SL in a two-speaker condition.Figure 3 shows log-scale grid densities calculated for a sampleinput window and the corresponding predicted and ground truthpairs of directions. As it can be seen, the identified peaks arevery sharp and lie close to ground-truth locations of the sources,a trend that we found to be generally true when using C-SL. Inorder to quantify the error between the predicted set of sensorframe directions and their ground truth values, denoted by ¯ R s and R s , we used a weighted version of Chamfer distance to matchdirections in the two sets and measure the deviation between themas following: d w-chamfer ( ¯ R s , R s ) = 1 | R s | (cid:88) r ∈ R s min r (cid:48) ∈ ¯ R s σ ( r, r (cid:48) )+ 1 (cid:80) r (cid:48) ∈ ¯ R s ψ ( r (cid:48) ) (cid:88) r (cid:48) ∈ ¯ R s ψ ( r (cid:48) ) min r ∈ R s σ ( r, r (cid:48) ) (8) A P REPRINT The weighting can be thought of as a mechanism to filter out spurious peaks based on their density without having tochoose thresholds. We calculated values of this metric for different window durations ranging from 50 ms to .Results, shown in Table 2, demonstrate that in conjunction with the appropriate clustering scheme C-SL can also beutilized in multi-source environments. In this paper, we presented Contrastive Sound Localization (C-SL), a framework for learning acoustic-spatial mappingsfrom unlabled data collected by microphone arrays of arbitrary geometry. C-SL combines contrastive learning andacoustic-inertial sensor fusion to simultaneously calibrate the array and estimate DOAs in a self-supervised manner. Ourevaluations demonstrate that, by leveraging array movements, C-SL can localize sounds in a wide range of conditionswith no additional information about the array or the sources available The relaxed data collection, simplicity and lowcomputational requirements to train the model, together with the encouraging results in challenging conditions areadvancements offered by C-SL that pave the way toward personalized hearing applications. References [1] J. Blauert, Spatial hearing: the psychophysics of human sound localization . MIT press, 1997.[2] H. Wallach, “The role of head movements and vestibular and visual cues in sound localization.,” Journal ofExperimental Psychology , vol. 27, no. 4, p. 339, 1940.[3] D. Genzel, U. Firzlaff, L. Wiegrebe, and P. R. MacNeilage, “Dependence of auditory spatial updating on vestibular,proprioceptive, and efference copy signals,” Journal of neurophysiology , vol. 116, no. 2, pp. 765–775, 2016.[4] B. G. Shinn-Cunningham, N. I. Durlach, and R. M. Held, “Adapting to supernormal auditory localization cues. i.bias and resolution,” The Journal of the Acoustical Society of America , vol. 103, no. 6, pp. 3656–3666, 1998.[5] J. H. DiBiase, A high-accuracy, low-latency technique for talker localization in reverberant environments usingmicrophone arrays . Brown University Providence, RI, 2000.[6] R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE transactions on antennas andpropagation , vol. 34, no. 3, pp. 276–280, 1986.[7] R. Roden, N. Moritz, S. Gerlach, S. Weinzierl, and S. Goetze, “On sound source localization of speech signalsusing deep neural networks,” 2015.[8] X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, “A learning-based approach to direction of arrivalestimation in noisy and reverberant environments,” in , pp. 2814–2818, IEEE, 2015.[9] S. Chakrabarty and E. A. Habets, “Broadband doa estimation using convolutional neural networks trained withnoise signals,” in ,pp. 136–140, IEEE, 2017.[10] Z.-Q. Wang, X. Zhang, and D. Wang, “Robust speaker localization guided by deep learning-based time-frequencymasking,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 27, no. 1, pp. 178–188,2018.[11] S. Adavanne, A. Politis, and T. Virtanen, “Direction of arrival estimation for multiple sound sources usingconvolutional recurrent neural network,” in ,pp. 1462–1466, IEEE, 2018.[12] D. Comminiello, M. Lella, S. Scardapane, and A. Uncini, “Quaternion convolutional neural networks for detectionand localization of 3d sound events,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , pp. 8533–8537, IEEE, 2019.[13] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprintarXiv:1807.03748 , 2018.[14] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visualrepresentations,” arXiv preprint arXiv:2002.05709 , 2020.[15] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” arXiv preprint arXiv:1910.10699 ,2019. 8-SL: Contrastive Sound Localization with Inertial-Acoustic Sensors A P REPRINT [16] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain, “Time-contrastive networks:Self-supervised learning from video,” in ,pp. 1134–1141, IEEE, 2018.[17] S. Pirk, M. Khansari, Y. Bai, C. Lynch, and P. Sermanet, “Online object representations with contrastive learning,” arXiv preprint arXiv:1906.04312 , 2019.[18] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, “The sound of pixels,” in Proceedings of the European Conference on Computer Vision (ECCV) , pp. 570–586, 2018.[19] A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in Proceed-ings of the European Conference on Computer Vision (ECCV) , pp. 631–648, 2018.[20] C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba, “Self-supervised moving vehicle tracking with stereo sound,”in Proceedings of the IEEE International Conference on Computer Vision , pp. 7053–7062, 2019.[21] H. Liu, Z. Zhang, Y. Zhu, and S.-C. Zhu, “Self-supervised incremental learning for sound source localization incomplex indoor environment,” in , pp. 2599–2605, IEEE, 2019.[22] S. Kumar, W. Sedley, K. V. Nourski, H. Kawasaki, H. Oya, R. D. Patterson, M. A. Howard III, K. J. Friston, andT. D. Griffiths, “Predictive coding and pitch processing in the auditory cortex,” Journal of Cognitive Neuroscience ,vol. 23, no. 10, pp. 3084–3094, 2011.[23] Y. Almalioglu, M. Turan, A. E. Sari, M. R. U. Saputra, P. P. de Gusmão, A. Markham, and N. Trigoni, “Selfvio:Self-supervised deep monocular visual-inertial odometry and depth estimation,” arXiv preprint arXiv:1911.09968 ,2019.[24] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEETransactions on Robotics , vol. 34, no. 4, pp. 1004–1020, 2018.[25] R. Mur-Artal and J. D. Tardós, “Visual-inertial monocular slam with map reuse,” IEEE Robotics and AutomationLetters , vol. 2, no. 2, pp. 796–803, 2017.[26] S. Madgwick, “An efficient orientation filter for inertial and inertial/magnetic sensor arrays,” Report x-io andUniversity of Bristol (UK) , vol. 25, pp. 113–118, 2010.[27] J. B. Kuipers et al. , Quaternions and rotation sequences , vol. 66. Princeton university press Princeton, 1999.[28] K. Wu, V. G. Reju, and A. W. Khong, “Multisource doa estimation in a reverberant environment using a singleacoustic vector sensor,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 26, no. 10,pp. 1848–1859, 2018.[29] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deepneural networks,” in Advances in neural information processing systems , pp. 901–909, 2016.[30] D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpurir: A python library for room impulse response simulationwith gpu acceleration,” arXiv preprint arXiv:1810.11359 , 2018.[31] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of theAcoustical Society of America , vol. 65, no. 4, pp. 943–950, 1979.[32] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “Darpa timit acousticphonetic continuous speech corpus cdrom,” 1993.[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[34] V. Tourbabin, J. Donley, B. Rafaely, and R. Mehra, “Direction of arrival estimation in highly reverberantenvironments using soft time-frequency mask,” in , pp. 383–387, IEEE, 2019.9-SL: Contrastive Sound Localization with Inertial-Acoustic Sensors A P REPRINT Appendix Theorem. Given bijective functions g, f : A → S defined on the non-empty set A and the constraint C : ∀ x, y ∈ A, ∀ R x , R y ∈ SO (3) : R x f ( x ) = R y f ( y ) ⇐⇒ R x g ( x ) = R y g ( y ) , C holds if and only if f = ± g .Proof. It is trivial to show C holds when f = ± g .Now let θ ( R ) denote the rotation angle corresponding to rotation matrix R . It can be shown that: ∀ u, v ∈ S : (cid:104) u, v (cid:105) = max R ∈ SO (3): Ru = v cos ( θ ( R )) (1)Using this we can show that if C holds: ∀ x, y ∈ A : (cid:104) f ( x ) , f ( y ) (cid:105) = (cid:104) g ( x ) , g ( y ) (cid:105) (2)By setting values of y in (2) to a = g − ( ˆ i ) , a = g − ( ˆ j ) , and a = g − ( ˆ k ) we get: g = P f (3)where P = [ f ( a ) f ( a ) f ( a )] T (4)Furthermore since (3) also holds for a , a and a it can be shown that: P T P = P P T = I (5)By plugging (3) into C and setting R = R − y R x , we will get: ∀ x, y ∈ A, ∀ R ∈ SO (3) : Rf ( x ) = f ( y ) ⇐⇒ RP f ( x ) = P f ( y ) (6)which is equivalent to ∀ R ∈ SO (3) : RP = P R (7)(5) and (7) imply that P = ± I . Therefore f = ± gg