[PDF] Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Abstract

Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the acoustic and the visual modality may be corrupted in specific spatial regions, for instance due to poor lighting conditions or to the presence of background noise. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions in the localization space. This fusion is achieved via a neural network, which combines the predictions of individual audio and video trackers based on their time- and location-dependent reliability. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.

Full PDF

DDATA FUSION FOR AUDIOVISUAL SPEAKER LOCALIZATION:EXTENDING DYNAMIC STREAM WEIGHTS TO THE SPATIAL DOMAIN

Julio Wissing (cid:63)

Benedikt Boenninghoff (cid:63)

Dorothea Kolossa (cid:63)

Tsubasa Ochiai † Marc Delcroix † Keisuke Kinoshita † Tomohiro Nakatani † Shoko Araki † Christopher Schymura (cid:63)(cid:63)

Institute of Communication Acoustics, Ruhr University Bochum, Bochum, Germany † NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan

ABSTRACT

Estimating the positions of multiple speakers can be helpful for taskslike automatic speech recognition or speaker diarization. Both ap-plications beneﬁt from a known speaker position when, for instance,applying beamforming or assigning unique speaker identities. Re-cently, several approaches utilizing acoustic signals augmented withvisual data have been proposed for this task. However, both theacoustic and the visual modality may be corrupted in speciﬁc spatialregions, for instance due to poor lighting conditions or to the pres-ence of background noise. This paper proposes a novel audiovisualdata fusion framework for speaker localization by assigning individ-ual dynamic stream weights to speciﬁc regions in the localizationspace. This fusion is achieved via a neural network, which combinesthe predictions of individual audio and video trackers based on theirtime- and location-dependent reliability. A performance evaluationusing audiovisual recordings yields promising results, with the pro-posed fusion approach outperforming all baseline models.

Index Terms — audiovisual speaker localization, data fusion,dynamic stream weights

1. INTRODUCTION

Data fusion is a process aimed at combining several data sources toaccomplish a task more reliably than it would be possible with justone individual sensor. This has several useful applications, rangingfrom medical diagnostics [1] to robotics [2] and smart buildings [3].Tracking one or more speakers by means of acoustic and visual sen-sors also relies on an appropriate fusion of sensory inputs. This pro-cess is performed unconsciously in human listening, since humansnaturally combine visual and acoustic stimuli in estimating the posi-tion of a person speaking or a sound source emitting noise [4].Recent works in the domain of audiovisual speaker localizationshowed promising results utilizing different data fusion strategies.For example, [5, 6] introduce a variational Bayesian approximationto optimally merge acoustic and visual data for combined localiza-tion and tracking. A related approach introduced in [7] uses an ex-pectation maximization (EM) algorithm for weighted clustering inthe audiovisual observation space. Other probabilistic techniquesfor audiovisual speaker tracking based on particle ﬁlters were pro-posed in [8, 9]. Additionally, the work in [10] combined audiovisualspeaker localization with speech separation on a robotic platform ina real-time processing system. A similarly designed audiovisual fu-sion system for smart rooms [11] allows for joint localization andidentiﬁcation of persons present in a room.A classical approach to combine different sensors is the Kalmanﬁlter (KF) [13], where inference in a linear dynamical system (LDS)

Fig. 1 . Frame from the AVDIAR dataset [12], overlayed withspeaker positions predicted by the proposed fusion model. Note thatthe model utilizes a discrete grid to represent the localization space,where each grid-cell is associated with a speaker presence probabil-ity, described in Sec. 2.3.is carried out to estimate all latent variables of interest; such as posi-tions, velocities or accelerations, based on noisy measurement data.Extending this concept to dynamic stream weights (DSWs) allowsus to weight the contributions of each input based on their instanta-neous reliability. This concept was originally proposed in the con-text of audiovisual automatic speech recognition (ASR) [14], and hasproven valuable for speaker identiﬁcation [15], but was also recentlyadopted for speaker localization and tracking [16, 17].All of these approaches specify DSWs as time-dependentweighting factors for the acoustic and visual modalities. In thisstudy, we introduce a fundamental update: Since the reliability ofthe acoustic and visual streams may also vary over time and location ,we extend the idea of DSWs to the spatial dimension, as depicted inFig. 1. Therefore, we introduce a spatial weighting matrix to replacethe scalar DSWs used in previous works. Compared to previouslyproposed approaches, spatial DSWs can effectively capture speciﬁcposition-dependent sensor reliability properties, e.g., bright lightingconditions coming from a certain direction or directional acousticnoise sources. Whereas scalar DSWs can only provide a globalestimate of sensor reliability, the model proposed here provides amore ﬁne-grained location-dependent framework for data fusion.

2. SYSTEM DESCRIPTION

The proposed data fusion network consists of two building-blocks:the individual unimodal tracking networks and the DSW-based datafusion network, as depicted in Fig 2. a r X i v : . [ c s . S D ] F e b udio-TrackerVideo Tracker Data-Fusion3x1024x12 20x2420x24720x450 20x24 Fig. 2 . High-level overview of the proposed data fusion framework.The acoustic and visual inputs X A ,t and X V ,t are ﬁrst processed viaunimodal trackers and then combined in a dedicated fusion network,producing a grid of speaker presence probabilities Y t as output. We adopt the unimodal audio tracker from the DOAnet archi-tecture [18]. Here, a multi-channel short-time Fourier transform(STFT) is performed on the individual microphone signals. Time-frequency analysis is conducted on the 6-channel acoustic inputsusing a window length of 1920 samples with no overlap and a fastFourier transform (FFT) size of 2048. This results in a frame lengthof ms, matching the length of one video frame in the utilizeddataset. A context of one frame in both temporal directions aroundthe t -th time step is used, which results in a 3-dimensional acousticinput tensor X A ,t of dimensions × × , representing thetime, frequency and channel dimensions. The third dimension isspanned by the stacked amplitude and phase of all 6 audio channels.Similar to the DOAnet architecture, the acoustic input tensor isfed through 4 convolutional stages ﬁrst. Herein, each stage is com-posed of a 2-dimensional convolution with ﬁlters and a kernelsize of × , batch normalization, rectiﬁed linear unit (ReLU) non-linearity, max-pooling of size × , and a dropout layer with a rateof . . Subsequently, the tensor is ﬂattened, yielding a dimension-ality of × , and input to a bidirectional gated recurrent unit(GRU) with two layers. The output of these recurrent layers is thenfed to a classiﬁcation network, composed of three fully connectedlayers with , and neurons and ReLUs, respectively.Lastly, the output layer is constructed using another fully connectedlayer with neurons and sigmoid activations. The outputs are re-shaped to a grid with × cells, as depicted in Fig. 1. The gridsize was chosen to roughly represent the size of a face in the utilizeddataset. Using a binary indicator variable ζ i,j corresponding to the i -th row and j -th column of the grid, the probability that a speakeris located at this particular grid point given only the current acousticobservation can be expressed as z ( i,j ) A ,t = p ( ζ i,j | X A ,t ) . Video tracking is conducted using the you only look once (YOLO)framework proposed in [19], which achieves a high detection rate incombination with real-time tracking capabilities. The implementa-tion of YOLOFace is used, which is based on YOLOv3 [20]. Beforean input video frame X V ,t is processed by the video tracker, it is ﬁrstresized to × pixels to meet the requirements of YOLOv3. Weadapt the bounding box output to the × grid size. The grid sizecan be set to arbitrary values if one is willing to retrain the network.We evaluated the localization performance using several grid sizes.However, we did not ﬁnd any signiﬁcant differences in localizationperformance. We decided to set the size of the individual grid-cellsto roughly match the size of a human face in the dataset. The videotracker’s output probability is denoted as z ( i,j ) V ,t = p ( ζ i,j | X V ,t ) . Convolution

64 Filters, 3x3 size,ReLU, padding: same

Convolution

64 Filters, 3x3 size,ReLU, padding: same

Convolution

32 Filters, 3x3 size,ReLU, padding: same

Convolution

32 Filters, 3x3 size,ReLU, padding: same

Convolution

Fig. 3 . Architecture of the acoustic and visual DSW estimation net-works, taking as input the observations Z A ,t and Z V ,t obtained fromthe acoustic and visual trackers and producing as output the corre-sponding DSWs Λ A ,t and Λ V ,t covering the spatial grid. At each time t , a conventional data fusion strategy would handle allacoustic and visual tracker outputs equally. In the log-domain, this ﬂat fusion strategy can be expressed as y ( i,j ) t = log z ( i,j ) A ,t + log z ( i,j ) V ,t , (1)where y ( i,j ) t represents the log-likelihood of a speaker being presentin grid-cell ( i, j ) at time-step t , taking into account both acous-tic and visual observations. Note that the derivation of Eq. (1) isbased on assuming statistical independence between acoustic andvisual observations, as well as imposing an uninformative prior on ζ i,j ∀ i, j , cf. [16]. However, when working with audiovisual inputdata, modalities may not be equally informative in all spatial regionsdue to, e.g., challenging lighting conditions or a directional noisesource. In these cases, it should be possible to weight each sourceindividually to lessen or increase the impact of an input. This pro-cess must be dynamic, in the sense that the network should be ableto adapt over space and time. A weighted fusion approach y ( i,j ) t = λ ( i,j ) A ,t log z ( i,j ) A ,t + λ ( i,j ) V ,t log z ( i,j ) V ,t (2)is proposed here, where λ ( i,j ) A ,t , λ ( i,j ) V ,t ∈ [0 , denote the location-and time-dependent DSWs for the acoustic and visual modalities,respectively. In this manner, we are proposing an extension of theDSW idea, originally introduced to handle only time-variant sensorreliability, to the spatial domain, allowing us to learn how differentlocations are covered more or less reliably by the available modali-ties. All relevant variables can be expressed in matrix notation as Y t = (cid:104) y ( i,j ) t (cid:105) i,j , Z m,t = (cid:104) z ( i,j ) m,t (cid:105) i,j , Λ m,t = (cid:104) λ ( i,j ) m,t (cid:105) i,j , with m ∈ { A , V } , each covering the entire × grid. The actualdata fusion is performed conjointly with estimating the DSWs inEq. (2) and the subsequent estimation of the speaker positions. The individual acoustic and visual tracking networks discussed inSecs. 2.1 and 2.2, serve as inputs to the fusion neural network. Therchitecture of the corresponding DSW estimation networks is de-picted in Fig. 3. Its output is subsequently combined with the outputsfrom both unimodal trackers according to Eq. (2). To further reﬁnethe resulting speaker presence probability grid, it is fed into a dedi-cated reﬁnement network based on the image restoration architectureproposed in [21]. The parameterization of the model [21] is adaptedto the speciﬁc requirements of this work. In particular, the input andoutput size is set to × , matching the size of the probability grid.Each convolutional layer consists of 128 ﬁlters with a kernel size of × and padding. The reﬁnement operation is valuable to removeoutliers and noisy estimates from the output grid, which should onlycontain peaks in grid cells close to the true speaker positions.

3. EXPERIMENTAL SETUP

In this section, we deﬁne the experimental settings and metrics . The AVDIAR [12] dataset is employed to evaluate the tracking ca-pabilities of the proposed data fusion strategy. It contains 27 audio-visual sequences, which were recorded in three different rooms witha duration between 10 seconds and 3 minutes. The overall durationof the dataset is about 27 minutes. The recordings were conductedusing 6 microphones attached to a dummy head and two cameras lo-cated right below the head. The microphone signals were acquiredusing a sampling rate of 48 kHz. The cameras cover a ﬁeld of viewof ◦ × ◦ at 25 frames per second. The video signals have aresolution of × pixels. In this work, only one video signalfrom the left camera is used. The dataset was partitioned accordingto a training, validation and test split utilizing approximately , and of the available data, respectively. The splitting pro-cedure ensured that all subsets included audiovisual sequences withone, two, three and four simultaneously visible speakers. All trainable weights were initialized using the Glorot initializationscheme [22]. In our experiments, the Adam optimizer [23] with abatch size of 64, a learning rate of 0.001 and early stopping yieldedthe best results. The fusion networks were trained for 20 epochs,with early stopping triggering in the ﬁrst ﬁve epochs with a patienceof 4. The audio tracking networks were trained for 100 epochs, withearly stopping triggering after 30 epochs with a patience of 20.

Localization error (LE) and frame recall (FR), as proposed in [18,24], were utilized as evaluation metrics in this work. The LE met-ric was slightly modiﬁed to cope with the discrete grid structureused here. Therefore, the Euclidean distance between a ground-truth speaker position and the center point of a grid cell where aspeaker was detected (both represented as pixel coordinates), wasutilized. This distance was computed for all possible combinationsof detected speakers and ground-truth speaker positions at each time-frame. If fewer speakers were predicted than are actually present,the prediction vector was ﬁlled with the next highest activation, tohave at least one prediction per target. To calculate the localiza-tion error for an estimated speaker position given the corresponding A Python implementation of all algorithms described in this paper ispublicly available at https://github.com/rub-ksv/spatial-stream-weights

LE [px] FR FPAudio-only (Sec. 2.1) 11.93 0.24 0.072Video-only (Sec. 2.2) 19.95 0.83 0.006Flat fusion (Eq. 1) 7.65 0.82 0.003

Table 1 . Results of the baseline trackers. For audio only the resultsfor a threshold of 0.75 are given. For audio-visual tracking, we showthe results of spatially ﬂat fusion at a threshold of 0.75.ground-truth, a cost matrix was optimized using the Hungarian al-gorithm [25]. The resulting distances were summed up and dividedby the total number of speakers present at the particular time-frame,yielding the average distance per speaker. This process is repeatedand averaged for the complete test set to obtain the ﬁnal LE metric.The FR represents the ability of the network to predict the cor-rect number of speakers in a speciﬁc frame. If the estimated numberof speakers matches the ground-truth in that frame, the metric is setto one, otherwise it is zero. In the proposed framework, the num-ber of speakers is set to the number of posterior grid probabilitiessurpassing the speciﬁed detection threshold, which has to be set em-pirically or via, e.g., a grid-search approach. Potentially, it mightbe included into an end-to-end optimization pipeline. Therefore, inSec. 4.1, different threshold values are evaluated for this purpose.The FR cannot differentiate between missed and excessive num-bers of speakers. Therefore, we additionally count the average num-ber of false positives (FPs) for each frame. This metric can be usedto better understand and analyze causes of any detection issues. Alow FR in combination with a high FP suggests that the network isnot performing well due to the threshold being too low. On the otherhand, in combination with a low FP it seems reasonable that eitherthe threshold is too high, pruning too many activations, or that thedetector is not working properly.

4. RESULTS AND DISCUSSION

In the following, we present and analyze the results of our baselinesystems and the proposed fusion network architecture.

The results of our two single-modality baselines, i.e. audio- andvideo-only tracking systems, are provided in the ﬁrst two rows ofTab. 1. As expected, the video-only baseline is generally superior ingood visual conditions. The audio-only system outperforms videolocalization when faces are occluded or outside the ﬁeld-of-view.With the single-modality audio tracker and a threshold of 0.75,an LE of 11.93 pixels was achieved, while the FR was quite low at23%. On average, a frame contained 1.07 false positives. Fig. 4 (a)shows the performance of the audio tracker as a function of thethreshold. The LE shows that the audio tracker predicts speakerswith an error of roughly half a grid point. However, the low FRindicates that many false positives are included in the tracking re-sult. Even with a threshold of 0.75, on average, one false positive isincluded in every frame.In comparison to the audio-only system, the video tracker in thesecond row in Tab. 1 unsurprisingly achieves a signiﬁcant higher FRof 83.45%, due to its access to a highly informative modality, in linewith its good benchmark results in other works [20, 26]. Note thatno threshold is needed for the video tracker, as its output is binary.Additionally, the number of false positives per frame is lower, at .5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95Threshold051015202530354045 Lo c a li z a t i on E rr o r [ p x ] F r a m e R e c a ll (a) Audio Baseline Lo c a li z a t i on E rr o r [ p x ] F r a m e R e c a ll (b) Spatially invariant DSWs Lo c a li z a t i on E rr o r [ p x ] F r a m e R e c a ll (c) Weighted Fusion Fig. 4 . Benchmark results plotted with respect to the threshold. The dashed lines represent the best unimodal baseline results.0.006. The LE is still below the size of one grid point with 19.95pixels distance on average.The face tracking capabilities lead to a good result with YOLOdetecting the right number of faces in most cases. This is also re-ﬂected by the number of false positives, which are close to zero.Finally, the third row shows results of our naive audio-visualbaseline system with the ﬂat fusion, implemented according toEq. (1). As can be seen, the performance improves already, giv-ing a lower localization error and false positive rate than the bestuni-modal system at a virtually identical frame recall to the better-performing video setup.

Fig. 4 (b) shows the results for the spatially invariant dynamic streamweighting approach. In this case, the DSWs in Eq. (2) are simpliﬁedby using a time-variant but spatially invariant scalar, i.e.: y ( i,j ) t = λ t log z ( i,j ) A ,t + (1 − λ t ) log z ( i,j ) V ,t .The dashed lines in Fig. 4 (b) represent the best reported single-modality baseline results for the LE and FR, respectively. The spa-tially invariant fusion is able to achieve a similar, but slightly worseFR of 0.82. Taking the threshold into consideration, the FR risesfrom 0.1 to above 0.7 for thresholds of 0.5 to 0.6. The reason isthat the low activations around 0.5 of the audio-only tracker are car-ried over to the spatially invariant fusion and the ﬂat fusion as well.Beyond a threshold of 0.6, the FR increases slowly but steadily.Comparing the LEs in Fig. 4 (b) with the reported values inTab. 1, we can observe that incorporating additional information tothe data fusion layer leads to further improvements in the model’sability to extrapolate the ground-truth location of the speakers: Withtemporal information, we see an improvement from 11.9 to 10.1 pix-els. With the ﬂat fusion strategy reported in Tab. 1, we would see alower LE, reduced from 10.1 to 7.7 pixels.Through multi-modal fusion, the number of false positives alsoimproves w.r.t. the respective best-performing unimodal system,which is again similarly observable for the ﬂat fusion strategy. Notethat lower thresholds generally return worse results and choosing ahigher threshold helps to mask out outliers. Increasing the thresholdbeyond 0.55 does not lead to any further changes in the LEs and FPs,indicating that outliers are sufﬁciently suppressed at this threshold. Fig. 4 (c) shows the performance of the weighted fusion system.At a threshold of 0.75, the network achieves a FR of 0.92, clearly outperforming all the other strategies. Again, the low performanceat a threshold of 0.5 mentioned in Sec. 4.2 is still visible.In terms of the LE, the best result was achieved with 4.10 pixelsdistance at a threshold of 0.75. For all thresholds above 0.5, nearlyconstant results were measured. The FPs per frame were relativelylow, at an average of 0.028 over-predictions per frame.Overall, the spatially weighted fusion strategy with a location-dependent matrix achieved the best performance. The target speak-ers were located with a high precision, showing the lowest LE mea-sured across all experiments. Even though the number of false pos-itives was increased slightly, the overall capabilities of the networkare signiﬁcantly better compared to the other systems.Finally, comparing Fig. 4 (b) and (c), the questions arises,whether the improved performance may be caused by the abilityto focus on different modalities in different parts of the room andwhether this ability may be helpful in dealing with multiple speak-ers. To investigate these questions, we compared the performancemetrics for audio frames containing only a single speaker with au-dio frames containing different speakers. As hypothesized, weachieved the same results (LE = 1.31, FR = 1 and FP = 0) for bothfusion strategies, when only a single speaker is involved. In contrast,when 4 speakers are present, the performances differ signiﬁcantly.While the performance of the spatially weighted fusion strategy re-mains stable (LE = 0,53, FR = 0,99, FP = 0), the LE of the spatiallyinvariant fusion strategy suffers (LE = 10,4, FR = 0,81, FP = 0).While based on comparatively little data, this analysis appears toindicate that spatial weighting is better equipped for dealing withmulti-speaker scenarios.

5. CONCLUSION

In this paper, we introduced a novel deep-learning-based data-fusionstrategy by extending the idea of dynamic temporal stream weightsto the spatial domain. Experiments show that the unimodal baselinevideo tracker based on the YOLO-framework achieved accurate re-sults as a stand-alone system, while, unsurprisingly, using only theunimodal audio tracker led to a signiﬁcantly higher false positiverate. However, by fusing acoustic and visual information, it waspossible to achieve an improved tracking result, with a 10% increaseof the frame recall and a far more accurate localization, showing anerror of 3.83 pixels on average. The results show that the audiovisualtracking system beneﬁts from the proposed data fusion strategy, andthat adding spatial dependence of stream weights is highly beneﬁcialcompared to a purely temporal dependence, as in standard dynamicstream weighting. . REFERENCES [1] J. J. A. Mendes, M. Vieira, M. B. Pires, and S. L. Stevan, “Sen-sor fusion and smart sensor in sports and biomedical applica-tions,”

Sensors (Basel, Switzerland) , vol. 16, no. 10, September2016.[2] M. Kam, X. Zhu, and P. Kalata, “Sensor fusion for mobilerobot navigation,”

Proceedings of the IEEE , vol. 85, no. 1, pp.108–119, 1997.[3] W. Li, Z. Wang, G. Wei, L. Ma, J. Hu, and D. Ding, “A surveyon multisensor fusion and consensus ﬁltering for sensor net-works,”

Discrete Dynamics in Nature and Society , vol. 2015,pp. 683701, Oct 2015.[4] J. R. Hershey and J. R. Movellan, “Audio Vision: Using audio-visual synchrony to locate sounds,” in

Advances in Neural In-formation Processing Systems 12 , S. A. Solla, T. K. Leen, andK. M¨uller, Eds., pp. 813–819. MIT Press, 2000.[5] Y. Ban, X. Alameda-Pineda, Laurent Girin, and R. Horaud,“Variational Bayesian inference for audio-visual tracking ofmultiple speakers,”

CoRR , vol. abs/1809.10961, 2018.[6] X. Alameda-Pineda, S. Arias, Y. Ban, G. Delorme, L. Girin,R. Horaud, X. Li, B. Morgue, and G. Sarrazin, “Audio-visual variational fusion for multi-person tracking with robots,”in

Proceedings of the 27th ACM International Conferenceon Multimedia , New York, NY, USA, 2019, MM ’19, p.1059–1061, Association for Computing Machinery.[7] I. D. Gebru, X. Alameda-Pineda, R. Horaud, and F. Forbes,“Audio-visual speaker localization via weighted clustering,” in , Sep. 2014, pp. 1–6.[8] D. Gatica-Perez, G. Lathoud, I. McCowan, J. . Odobez, andD. Moore, “Audio-visual speaker tracking with importanceparticle ﬁlters,” in

Proceedings 2003 International Conferenceon Image Processing , 2003, vol. 3, pp. III–25.[9] D. Gatica-Perez, G. Lathoud, J. Odobez, and I. McCowan,“Audiovisual probabilistic tracking of multiple speakers inmeetings,”

IEEE Transactions on Audio, Speech, and Lan-guage Processing , vol. 15, no. 2, pp. 601–616, Feb 2007.[10] K. Nakadai, K. Hidai, H. G. Okuno, and H. Kitano, “Real-timespeaker localization and speech separation by audio-visual in-tegration,” in

Proceedings 2002 IEEE International Confer-ence on Robotics and Automation , 2002, vol. 1, pp. 1043–1049.[11] C. Busso, S. Hernanz, C.-W. Chu, S.-I. Kwon, S. Lee, P. G.Georgiou, I. Cohen, and S. Narayanan, “Smart room: partici-pant and speaker localization and identiﬁcation,” in

Proceed-ings. (ICASSP ’05). IEEE International Conference on Acous-tics, Speech, and Signal Processing, 2005. , 2005, vol. 2, pp.ii/1117–ii/1120 Vol. 2.[12] I. D. Gebru, S. Ba, X. Li, and R. Horaud, “Audio-VisualSpeaker Diarization Based on Spatiotemporal Bayesian Fu-sion,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 40, no. 5, 2018.[13] R. E. Kalman, “A new approach to linear ﬁltering and predic-tion problems,”

Journal of Basic Engineering , vol. 82, no. 1,pp. 35–45, 03 1960. [14] H. Meutzner, N. Ma, R. Nickel, C. Schymura, and D. Kolossa,“Improving audio-visual speech recognition using deep neuralnetworks with dynamic stream reliability estimates,” in , March 2017, pp. 5320–5324.[15] L. Sch¨onherr, D. Orth, M. Heckmann, and D. Kolossa, “En-vironmentally robust audio-visual speaker identiﬁcation,” in ,Dec 2016.[16] C. Schymura and D. Kolossa, “Audiovisual speaker track-ing using nonlinear dynamical systems with dynamic streamweights,”

IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 28, pp. 1065–1078, 2020.[17] C. Schymura, T. Ochiai, M. Delcroix, K. Kinoshita,T. Nakatani, S. Araki, and D. Kolossa, “A Dynamic StreamWeight backprop Kalman ﬁlter for audiovisual speaker track-ing,” in

ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , 2020,pp. 581–585.[18] S. Adavanne, A. Politis, and T. Virtanen, “Direction of arrivalestimation for multiple sound sources using convolutional re-current neural network,” in , Sep. 2018, pp. 1462–1466.[19] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “Youonly look once: Uniﬁed, real-time object detection,”

CoRR ,vol. abs/1506.02640, 2015.[20] J. Redmon and A. Farhadi, “YOLOv3: An incremental im-provement,”

CoRR , vol. abs/1804.02767, 2018.[21] X. Mao, C. Shen, and Y. Yang, “Image restoration using verydeep convolutional encoder-decoder networks with symmetricskip connections,” in

Advances in Neural Information Pro-cessing Systems 29 , D. D. Lee, M. Sugiyama, U. V. Luxburg,I. Guyon, and R. Garnett, Eds., pp. 2802–2810. Curran Asso-ciates, Inc., 2016.[22] X. Glorot and Y. Bengio, “Understanding the difﬁculty of train-ing deep feedforward neural networks,” in

Proceedings of theThirteenth International Conference on Artiﬁcial Intelligenceand Statistics , Yee Whye Teh and Mike Titterington, Eds., ChiaLaguna Resort, Sardinia, Italy, 13–15 May 2010, vol. 9 of

Pro-ceedings of Machine Learning Research , pp. 249–256, PMLR.[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-timization,” in , Y. Bengio and Y. LeCun, Eds.,2015.[24] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Soundevent localization and detection of overlapping sources usingconvolutional recurrent neural networks,”

IEEE Journal of Se-lected Topics in Signal Processing , vol. 13, no. 1, pp. 34–48,2019.[25] H. W. Kuhn, “The Hungarian method for the assignment prob-lem,”

Naval Research Logistics Quarterly , vol. 2, no. 1-2, pp.83–97, 1955.[26] W. Chen, H. Huang, S. Peng, C. Changsheng, and C. Zhang,“YOLO-Face: a real-time face detector,”