[PDF] 3D Localization of a Sound Source Using Mobile Microphone Arrays Referenced by SLAM

Abstract

A microphone array can provide a mobile robot with the capability of localizing, tracking and separating distant sound sources in 2D, i.e., estimating their relative elevation and azimuth. To combine acoustic data with visual information in real world settings, spatial correlation must be established. The approach explored in this paper consists of having two robots, each equipped with a microphone array, localizing themselves in a shared reference map using SLAM. Based on their locations, data from the microphone arrays are used to triangulate in 3D the location of a sound source in relation to the same map. This strategy results in a novel cooperative sound mapping approach using mobile microphone arrays. Trials are conducted using two mobile robots localizing a static or a moving sound source to examine in which conditions this is possible. Results suggest that errors under 0.3 m are observed when the relative angle between the two robots are above 30 degrees for a static sound source, while errors under 0.3 m for angles between 40 degrees and 140 degrees are observed with a moving sound source.

Full PDF

33D Localization of a Sound Source Using Mobile Microphone ArraysReferenced by SLAM

Simon Michaud, Samuel Faucher, Franc¸ois Grondin, Jean-Samuel Lauzon, Mathieu Labb´e,Dominic L´etourneau, Franc¸ois Ferland, Franc¸ois Michaud

Abstract — A microphone array can provide a mobile robotwith the capability of localizing, tracking and separating distantsound sources in 2D, i.e., estimating their relative elevation andazimuth. To combine acoustic data with visual information inreal world settings, spatial correlation must be established. Theapproach explored in this paper consists of having two robots,each equipped with a microphone array, localizing themselves ina shared reference map using SLAM. Based on their locations,data from the microphone arrays are used to triangulate in 3Dthe location of a sound source in relation to the same map. Thisstrategy results in a novel cooperative sound mapping approachusing mobile microphone arrays. Trials are conducted using twomobile robots localizing a static or a moving sound source toexamine in which conditions this is possible. Results suggestthat errors under 0.3 m are observed when the relative anglebetween the two robots are above 30 ◦ for a static sound source,while errors under 0.3 m for angles between 40 ◦ and 140 ◦ areobserved with a moving sound source. I. INTRODUCTIONOver the last 20 years, there has been a growing interest indeveloping real-time on-board artiﬁcial audition capabilitieson robots, with libraries like FlowDesigner [1], HARK [2],ManyEars [3] and ODAS [4]. In the recent ﬁve years, prod-ucts equipped with microphone arrays (MAs) (e.g., AmazonEcho, Apple HomePod, Google Home) opened a boomingmarket, and development kits are now commonly available(e.g., ReSpeaker 6-MA, XMOS xCORE 7-MA, 8SoundsUSB8-MA [3] and 16SoundsUSB 16-MA). Artiﬁcial auditiontechnology aims at providing more natural interaction withconnected devices such as mobile robots. Artiﬁcial auditionon a mobile robot can enrich visual perception of theenvironment by helping to discover interesting elements inreal world settings. For instance, a person talking or an objectmaking a sound can be used to draw the robots attentionto something worth looking at more closely, associating theperceived sound with the image at that same location. Thisinformation can then be used to make interesting multimodalassociations [5]: face recognition can be used to identify the

This work was supported in part by the Natural Sciences and EngineeringResearch Council of Canada (NSERC), the Fonds de recherche du Qu´ebec- Nature et technologies (FRQNT) and ACELP-3IT Funds, Universit´e deSherbrooke.S. Michaud, S. Faucher, J.-S. Lauzon, M. Labb´e, F. Grondin, F.Ferland, D. L´etourneau, F. Michaud are with the Department ofElectrical Engineering and Computer Engineering, InterdisciplinaryInstitute for Technological Innovation (3IT), 3000 boul. del’Universit´e, Universit´e de Sherbrooke, Qu´ebec (Canada) J1K 0A5, { Simon.Michaud,Samuel.Faucher2,Francois.Grondin2,Jean-Samuel.Lauzon,Mathieu.M.Labbe,Dominic.Letourneau,Francois.Ferland,Francois.Michaud } @USherbrooke.ca voice signature of a person, an image tagged to a ring maybe designated as a telephone, etc.Doing so requires associating visual data and audio eventsin relation to the same reference frame. Visual SLAM(Simultaneous Localization and Mapping) can be used togenerate a map of the environment to provide such referenceframe of robots equipped with a MA doing sound source lo-calization (SSL). However, assuming that the sound sourcesare far from the robots compared to the MA aperture (acondition known as the far ﬁeld effect), SSL only provideselevation and azimuth of sound sources [6]. Triangulatingdata from two or more MAs can be used to evaluate the3D location of a sound source, as demonstrated in [7] whenthe locations of static MAs are known. Using mobile MAswould make it possible, considering that the MA’s positionsare derived using SLAM and a shared reference map, toevaluate the 3D location of a sound source without havingMAs placed in ﬁxed positions. Mobile Robot 1 Mobile Robot 2Sound Source

MicsLidar Camera

Reference Map

Fig. 1: Localization of a sound source using mobile MAsShown by Fig. 1, this paper presents an approach ad-dressing this research question using two mobile robotsand one sound source. Each mobile robot is equipped witha lidar, a RGB-D camera and a MA. RTAB-Map (Real-Time Appearance-Based Mapping) [8], a visual and lidarSLAM library, is used by the robots to localize themselvesin a reference map m of the environment. ODAS (OpenembeddeD Audition System) [4], a sound source localization,tracking and separation library, is used to provide unit vectors λ ∈ S and λ ∈ S pointing in the direction of the soundsource for each robot, where S = { v ∈ R : (cid:107) v (cid:107) = 1 } , and (cid:107) . . . (cid:107) stands for the l -norm. These open source librarieswere chosen for convenience and to facilitate reproducibility a r X i v : . [ ee ss . A S ] J u l f the results. The closest intersection point of λ and λ isused to estimate the 3D location l mS of the sound source.The objective is identify the minimal conditions for 3Dtriangulation using mobile MAs is possible. The paper isorganized as follows. Section II provides an overview ofwork to situate our approach in relation to combining SSLand SLAM. Section III presents the approach implemented.Section IV describes the experimental setup, followed bySection V with the observed results.II. RELATED WORKRao-Blackwellized particle ﬁlter with Kalman ﬁltering arecommonly used for tracking sound sources. For instance, Linet al. [9] estimate the relative poses of a team of mobilerobots, each robot equipped with a pair of microphonesand emitting a specially-designed sound to simultaneouslyprovide robot identiﬁcation and the relative distances andbearing angles in 2D. This acoustic data is combined withodometry and ﬁltering is used to resolve the heading an-gle and the back-front ambiguities, implementing what isreferred to as cooperative acoustic robot localization [10].Teams of micro air vehicles (MAVs) equipped with 4-MAsuse a similar concept with Extended Kalman Filtering (EKF)to position themselves in relation to a beacon MAV circlingaround a reference point in space while emitting continuouspredeﬁned acoustic chirps [11].We identify three categories of approaches combining SSLwith SLAM. Acoustic SLAM (aSLAM) makes it possibleto localize the trajectory of a MA on a mobile robot whilstestimating the acoustic map of surrounding sound sources[12], [13], [14]. aSLAM basically exploits the movement ofa MA to constructively triangulate over time the 3D cartesianlocation of sound sources from bearing-only 2D Direction-of-Arrival (DoA) measurements, estimating the robot tra-jectory from the apparent displacement of sound sourcesobserved from multiple positions. aSLAM performance istherefore affected by the trajectory followed by the MA inrelation to the sound sources. Only validated in simulation,this approach is limited to a single robot with mappingreferenced to the robot’s initial position, and requires atleast two sound sources to work. Similar limitations applyto [15], [16] which adopt a similar approach using Kalmanﬁltering. In the same category, Sasaki et al. [17], [18] derive2D positions of multiple sound sources using a 32-MA andsound observations over the last 2 sec, and sound sourcecategorization is used to remove undesirable cross points. Inmore recent work, Sasaki et al. [19] designed a hand-heldunit equipped with a 3D lidar and Inertial Measurement Unit(IMU) for SLAM, and using HARK with a 32-MA for SSL.Particle ﬁltering from data taken over time (from 7.5 to 15sec) with the hand-held unit moving (rotation, displacement)provides 3D positions of two sound sources.An extension to aSLAM is audio-visual SLAM (AV-SLAM) , exploiting acoustic and visual features [20] forhuman tracking. In [20], validation is done using one robotequipped with a 8-MA running HARK and a RGB cameramoving in a straight line over 2.5 m in front of a stationary human sound source, providing only 2D localization. Bayramand Ince [21] present audio-visual multi-person 2D trackingby doing sensor fusion of a SSL module with a visual facerecognition module. Results are presented using two Kinectcameras and a 7-MA running HARK.Finally, the concept of audio-based SLAM [6] involvesconsidering the SSL problem as a SLAM problem. Forinstance, the FastSLAM [22] algorithm is used to estimatethe time offset and position of robots equipped with MAs andthe position of sound sources [23]. Sekiguchi et al. [6] useFastSLAM with static MA-equipped robots to consider theMAs as one big array. Results using HARK and three staticrobots with 8-MAs in an anechoic chamber and two movingtalkers are provided. Audio-based SLAM has also been usedfor online calibration of asynchronous MAs [24], [25] and foroptimizing the relative positions of multiple mobile robotswith MAs for cooperative sound source separation [26].III. S

OUND S OURCE M APPING U SING M OBILE MA S Our approach aims at using two mobile MAs to provideinstantaneous 3D location of a sound source. In relationto Section II, our approach differs by having mobile MAs,localized using SLAM based on m , triangulate sound sourcelocations in 3D also in relation to m . The concept can bedesignated as cooperative sound mapping , illustrated by thearchitecture diagram presented by Fig. 2. Each mobile robot i is equipped with a lidar, a RGB-D camera and onboardodometry to localize its location l mi ∈ R (3D position anda quaternion for rotation) in relation to a reference map m using RTAB-Map. Each robot i is also equipped with a 16-MA and uses ODAS to do sound source localization, trackingand separation. ODAS provides a 3D unit vector λ i ∈ S pointing in the direction of the sound source with respect tothe robot. Using data from two mobile robots ( i = { , } ),the Cooperative Sound Mapping module triangulates theposition of the sound source in relation to the reference map. A. RTAB-Map

RTAB-Map (Real-Time Appearance-Based Mapping) [27]is an open source library implementing graph-basedSPLAM (Simultaneous Planning, Localization And Map-ping) [28] i.e., the ability to simultaneously map an environ-ment, localize itself in it and plan paths. RTAB-Map providesthe robots’ positions and orientations, denoted as l m and l m ,respectively. RTAB-Map uses a combination of odometry,lidar and camera to robustly create a map and to localizein it. The lidar is used to create the 2D occupancy gridmap for obstacle avoidance and path planning. Appearance-based loop closure detection and localization are done withvisual features extracted from the RGB image of the RGB-D camera using a bag-of-words approach. By estimating the3D positions of visual features using the depth image, aposition and orientation against the map can be computed.The localization is then reﬁned using the lidar to improveaccuracy when environments are lacking visual features buthas a lot of geometry (which is often the case indoor). http://introlab.github.io/rtabmap/ DASRTAB-Map CooperativeSoundMapping

Reference MapRBG-D Camera Lidar Odometry Microphone Array

Robot ODASRTAB-Map

Robot Sound Source Location RBG-D Camera Lidar Odometry Microphone Array

Fig. 2: Architecture diagram of the cooperative sound mapping approach

B. ODAS

ODAS [4] is an open source library performing soundsource localization, tracking and separation. ODAS gener-ates the DoA for each of the two robots ( i = { , } ),denoted as λ and λ . This library relies on a localiza-tion method called Steered Response Power with PhaseTransform based on Hierarchical Search with Directivitymodel and Automatic calibration (SRP-PHAT-HSDA). Theproposed approach decomposes the search space in coarseand ﬁne grids, which speeds up the search in 3D for the DoAof one or many sound sources. Localization generates noisypotential sources, which are then ﬁltered with a trackingmethod based on a modiﬁed 3D Kalman ﬁlter (M3K) thatgenerates one or many tracked sources. Sound sources arethen ﬁltered and separated using directive geometric sourceseparation (DGSS) to focus the robot’s attention only on thetarget sound source, and ignore ambient noise. This libraryalso models microphones as sensors with a directive polarpattern, which improves sound source localization, trackingand separation when the direct path between microphonesand the sound sources is obstructed by the robot’s body.In this work, ODAS is conﬁgured to return the loudestsound source DoA per robot, denoted as λ and λ for robots i = { , } . Time synchronization between λ and λ isfacilitated using ODAS’ tracking module output because theDoAs are smoothed over time. C. Cooperative Sound Mapping

The Cooperative Sound Mapping module combines theinformation provided by the two robots to determine thelocation of the sound source l mS ∈ R . It ﬁrst rotates theDoA for each robot in relation to its orientation to derive thevectors λ and λ . In 3D space, λ and λ rarely intersecteach other. The estimation of l mS is derived by ﬁnding thesmallest distance between λ and λ , as represented by the http://odas.io TABLE I: Position of the microphones on the MA (cm)

Dimensions ∆ x ∆ x ∆ y ∆ y ∆ z ∆ z Pioneer2-DX . . . . . . TurtleBot2 . . . . . . dotted line in Fig. 1. Using the Ray to Ray algorithm [29]as in [7], the sound source position is estimated using (1): l mS = 12 ( l m + G λ + l m + G λ ) (1)where the expressions G and G are given as follows: G = ( λ · λ )( λ · ( l m − l m )) − λ · ( l m − l m )1 − ( λ · λ ) (2) G = ( λ · λ )( λ · ( l m − l m )) − λ · ( l m − l m )1 − ( λ · λ ) (3)and ( · ) stands for the dot product.IV. E XPERIMENTAL M ETHODOLOGY

Two SecurBot mobile robots are used: one equipped witha Jetson Nano core mounted on a TurtleBot2 base, and theother with a Jetson TX2 installed on a modiﬁed Pioneer2-DXbase. Both robots are equipped with a 16SoundsUSB MA,an Intel Realsense D435 camera and a RP Lidar. Each MA islocated 0.48 m above the ground and provides synchronousacquisition of microphone signals through USB to the robot’scomputer. Figure 3 and Table I present the MA conﬁgurationof each robot.The experiments are conducted in a 150 m room ﬁlledwith different objects to provide visual features for SLAM.The reference map is created using RTAB-Map with one ofthe robot. The room has a reverberation level of RT = 600 msec and no background noise. ODAS is conﬁgured withsimilar parameters as those in [4], except for the covariance https://github.com/introlab/securbot https://github.com/introlab/16SoundsUSB omponent of the observation noise matrix in the Kalmanﬁlter of ODAS’ tracking module, which is increased to σ R =0 . for more sensitivity to sound source acceleration, whichinﬂuences λ .

3D viewTop viewSide view

Fig. 3: MA conﬁgurationsTrials are conducted with the sound source located at astatic location at 1.12 m of height or by being manuallymoved horizontally and vertically, with its location beingmonitored using a Vicon motion system. The sound sourceis a loudspeaker generating white noise, with perceivedamplitude ranging from 25 to 30 dB over 3 m. Robots movearound the sound source by following preset trajectoriesdeﬁned in relation to the reference map. Figure 4 shows thetrajectories followed by the robots using ROS’s navigationstack [30] . These trajectories have the robots move from 0 to2 m/sec and are set to avoid collisions between the two robotsand to cover a variety of DoA conﬁgurations in relation to thesound source. A remote laptop computer, also running ROS,monitors the trials and records the localization and audiodata. RViz is used to display the position of the robots, themap, the DoAs, the estimated and the known locations of thesound source. RViz also displays Root Square Error (RSE)in meter between l mS and the actual sound source locationwith colored dots ranging from green to black from 0 to 0.5m, and to red as it increases. (a) Static sound source(b) Moving sound source Fig. 4: Experimental conditions with the two SecurBot robotsV. R

ESULTS

RSE is examined in relation to d m and d m , the distancesbetween the robots and the sound source, and the angle θ asdeﬁned by Fig. 5. θ is the angle between the lines from thetheoretical location of the sound source to l m and l m in 3D,as shown by Fig. 1. Mobile Robot 2SoundSourceMobile Robot 1

Fig. 5: Deﬁnition of d m , d m and θ Figure 6 and Fig. 7 summarizes the observations madeduring a trial with a static sound source and a moving soundsource. For the static sound source, data is recorded at 10Hz. Fig. 6a to Fig. 6c illustrate that RSE remains lowerthan approximately 0.3 m with negligible variance for θ higher than approximately ◦ , regardless of the distancesbetween the robots and the sound source, as illustrated byRViz in Fig. 8. These results are conﬁrmed by Fig. 6cwhich represents, over intervals of 15 ◦ , the average RSE

20 40 60 80 100012345

Time (sec) d m ( m ) — d m ( m ) — R S E ( m ) — (a) d m , d m and RSE over time Time (sec) θ ( ◦ ) — R S E ( m ) — (b) θ and RSE over time . . . . θ R S E ( m ) — (c) RSE in relation to θ Fig. 6: Trial with a static sound source

Time (sec) d m ( m ) — d m ( m ) — R S E ( m ) — (a) d m , d m and RSE over time Time (sec) θ ( ◦ ) — R S E ( m ) — (b) θ and RSE over time θ R S E ( m ) — (c) RSE in relation to θ Fig. 7: Trial with a moving sound sourceand its standard deviation. For smaller θ (which occur atthe ﬁrst and the last ∼

18 sec of the static sound sourcetrial), λ and λ are becoming parallel, and small changesin θ lead in larger errors. When parallel, the denominator of(1) results in a division by zero. So when θ is small, theclosest intersection point found between λ and λ quicklychanges as the denominator of (1) comes closer to zero.Figure 9 illustrates this situation. One alternative to limitsuch occurrences would be to use a third robot, to activelyreposition the robots to have θ higher than ◦ , or to userobots with MAs at different heights.The static sound source trial limits the possible range for θ because the sound source is located higher than MAs. Themoving sound source trial makes it possible to have θ changefrom 0 to 180 ◦ . Fig. 7a to Fig. 7b present data (recorded at100 Hz to synchronize with the Vicon system). Between 0to 30 sec, the robots are immobile and near each other, andthe sound source is moving. Because θ is small (9 ◦ ), largeerrors are observed as explained with the static condition.From 30 to 61 sec, robots are moving away from each other,with one coming closer to the sound source being static: as θ increases, RSE decreases. The peak at 55 sec is causedby σ R , overshooting l mS based on the inﬂuence of the soundsource acceleration as robots are near each other and thesound source starts moving. From 61 to 80 sec, large peaksoccur as θ is small (i.e., λ and λ are almost parallel) andthe robots’ motion makes θ change quickly. From 80 to 155sec, θ is sufﬁcient to have small RSE except when θ is near Fig. 8: Illustration in RViz of a case with θ > ◦ ◦ (111 sec) and at 120 sec when there was an obstaclebetween one robot and the sound source. For the remainingtime of the trial, θ rapidly decreases toward 0, creating RSEpeaks and larger errors. Figure 7c suggest that for θ between ∼ ◦ to ∼ ◦ , the RSE is lower than 0.3 m.VI. CONCLUSIONThis paper validates the concept of cooperative soundmapping by demonstrating, using RTAB-Map ad ODAS, thatit is possible to derive the 3D location of a sound sourceig. 9: Illustration in RViz of a case with θ smallusing mobile MAs. Results show that the capability of ap-proximating the location of the sound source from the closestintersection point found between λ and λ is inﬂuenced by θ and the sensitivity of sound source tracking. This couldbe ﬁltered out using a Kalman ﬁlter, as we observed in ourpreliminary trials. As the experiments presented in the paperinvolve the ideal case of only having one constant soundsource, the next steps in our work involve extending theapproach to use two and more MAs for online simultaneouslocalization of multiple intermittent sound sources in noisyand reverberation conditions, and coordinate the position-ing of the mobile robots to provide reliable 3D locationmeasurements according to their positions in relation to thesound sources. We believe that directly using the outputof sound source tracking instead of SSL will simplify theoverall complexity for cooperative sound mapping, targetingonboard centralized or distributed processing.R EFERENCES[1] D. L´etourneau, J.-M. Valin, C. Cˆot´e, and F. Michaud, “FlowDesigner:the free data-ﬂow oriented development environment,”

Software 2.0 ,vol. 3, 2005.[2] K. Nakadai, T. Takahashi, H. Okuno, H. Nakajima, Y. Hasegawa,and H. Tsujino, “Design and implementation of robot audition system‘HARK’ – Open source software for listening to three simultaneousspeakers,”

Advanced Robotics , vol. 24, no. 5-6, pp. 739–761, 2010.[3] F. Grondin, D. L´etourneau, F. Ferland, V. Rousseau, and F. Michaud,“The ManyEars open framework,”

Autonomous Robots , vol. 34, no. 3,pp. 217–232, 2013.[4] F. Grondin and F. Michaud, “Lightweight and optimized sound sourcelocalization and tracking methods for opened and closed microphonearray conﬁgurations,”

Robotics & Autonomous Systems , vol. 113, pp.63–80, 2019.[5] J. Sinapov, C. Schenck, and A. Stoytchev, “Learning relational objectcategories using behavioral exploration and multimodal perception,” in

IEEE Int. Conf Robotics and Automation , May 2014, pp. 5691–5698.[6] K. Sekiguchi, Y. Bando, K. Nakamura, K. Nakadai, K. Itoyama, andK. Yoshii, “Online simultaneous localization and mapping of multiplesound sources and asynchronous microphone arrays,” in

IEEE/RSJ Int.Conf. Intelligent Robots and Systems , 2016, pp. 1973–1979.[7] J.-S. Lauzon, F. Grondin, D. L´etourneau, A. L. Desbiens, andF. Michaud, “Localization of RW-UAVs using particle ﬁltering overdistributed microphone arrays,” in

IEEE/RSJ Int. Conf. IntelligentRobots and Systems , 2017, pp. 2479–2484. [8] M. Labb´e and F. Michaud, “RTAB-MAP as an open-source lidar andvisual SLAM library for large-scale and long-term online operation,”

Journal of Field Robotics , vol. 36, no. 2, pp. 416–446, 2018.[9] Y. Lin, P. Vernaza, J. Ham, and D. D. Lee, “Cooperative relative robotlocalization with audible acoustic sensing,” in

IEEE/RSJ Int. Conf.Intelligent Robots and Systems , 2005, pp. 3764–3769.[10] C. Drioli, G. Giordano, D. Salvati, F. Blanchini, and G. Foresti,“Acoustic target tracking through a cluster of mobile agents,”

IEEETrans. Cybernetics , pp. 1–14, 2019.[11] M. Basiri, F. Schill, D. Floreano, and P. U. Lima, “Audio-basedlocalization for swarms of micro air vehicles,” in

IEEE Int. ConfRobotics and Automation , May 2014, pp. 4729–4734.[12] C. Evers and P. A. Naylor, “Acoustic SLAM,”

IEEE/ACM Trans.Audio, Speech, & Language Proces. , vol. 26, no. 9, pp. 1484–98, Sep.2018.[13] C. Evers, Y. Dorfan, S. Gannot, and P. A. Naylor, “Source trackingusing moving microphone arrays for robot audition,” in

IEEE Int. Conf.Acoustics, Speech and Signal Processing , March 2017, pp. 6145–6149.[14] C. Evers, A. Moore, and P. Naylor, “Localization of moving mi-crophone arrays from moving sound sources for robot audition,” in

European Signal Processing Conf. , 2016, pp. 1008–1012.[15] C. Schymura and D. Kolossa, “Potential-ﬁeld-based active explorationfor acoustic simultaneous localization and mapping,” in

IEEE Int.Conf. Acoustics, Speech and Signal Processing , 2018, pp. 76–80.[16] Q. V. Nguyen, F. Colas, E. Vincent, and F. Charpillet, “Localizingan intermittent and moving sound source using a mobile robot,” in

IEEE/RSJ Int. Conf. Intell. Robots & Systems , 2016, pp. 1986–1991.[17] Y. Sasaki, S. Thompson, M. Kaneyoshi, and S. Kagami, “Map-generation and identiﬁcation of multiple sound sources from robotin motion,” in

IEEE/RSJ Int. Conf. Intelligent Robots and Systems ,Oct 2010, pp. 437–443.[18] Y. Sasaki, S. Kagami, and H. Mizoguchi, “Multiple sound sourcemapping for a mobile robot by self-motion triangulation,” in

IEEE/RSJInt. Conf. Intelligent Robots and Systems , Oct 2006, pp. 380–385.[19] Y. Sasaki, R. Tanabe, and H. Takemura, “Probabilistic 3D sound sourcemapping using moving microphone array,” in

IEEE/RSJ Int. Conf.Intelligent Robots and Systems , Oct 2016, pp. 1293–1298.[20] A. Chau, K. Sekiguchi, A. Nugraha, K. Yoshii, and K. Funakoshi,“Audio-visual SLAM towards human tracking and human-robot inter-action in indoor environments,” in

IEEE Int. Conf. Robot and HumanInteractive Communication , 2019, pp. 1–8.[21] B. Bayram and G. Ince, “Audio-visual multi-person tracking for activerobot perception,” in

IEEE/SICE Int. Symp. System Integration , Dec2015, pp. 575–580.[22] S. Thrun, M. Montemerlo, D. Koller, B. Wegbreit, J. Nieto, andE. Nebot, “FastSLAM: An efﬁcient solution to the simultaneouslocalization and mapping problem with unkown data association,”

J.Machine Learning Research , vol. 4, no. 3, pp. 380–407, 2004.[23] J. Hu, C. Chan, C. Wang, and C. Wang, “Simultaneous localizationof mobile robot and multiple sound sources using microphone array,”in

IEEE Int. Conf Robotics and Automation , May 2009, pp. 29–34.[24] H. Miura, T. Yoshida, K. Nakamura, and K. Nakadai, “SLAM-based online calibration of asynchronous microphone array for robotaudition,” in

IEEE/RSJ Int. Conf. Intelligent Robots and Systems , Sep.2011, pp. 524–529.[25] D. Su, T. Vidal-Calleja, and J. V. Miro, “Simultaneous asynchronousmicrophone array calibration and sound source localisation,” in

IEEE/RSJ Int. Conf. Intell. Robots & Systems , 2015, pp. 5561–5567.[26] K. Sekiguchi, Y. Bando, K. Itoyama, and K. Yoshii, “Optimizingthe layout of multiple mobile robots for cooperative sound sourceseparation,” in

IEEE/RSJ Int. Conf. Intelligent Robots and Systems ,Sep. 2015, pp. 5548–5554.[27] M. Labb´e and F. Michaud, “Long-term online multi-session graph-based SPLAM with memory management,”

Autonomous Robots , pp.1 – 18, 2017.[28] C. Stachniss,

Robotic Mapping and Exploration . Springer Science& Business Media, 2009, vol. 55.[29] P. Schneider and D. H. Eberly,

Geometric Tools for Computer Graph-ics . Morgan Kaufmann, 2002.[30] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,R. Wheeler, and A. Y. Ng, “ROS: An open-source Robot OperatingSystem,” in