[PDF] Noise-Robust Adaptation Control for Supervised Acoustic System Identification Exploiting A Noise Dictionary

Abstract

We present a noise-robust adaptation control strategy for block-online supervised acoustic system identification by exploiting a noise dictionary. The proposed algorithm takes advantage of the pronounced spectral structure which characterizes many types of interfering noise signals. We model the noisy observations by a linear Gaussian Discrete Fourier Transform-domain state space model whose parameters are estimated by an online generalized Expectation-Maximization algorithm. Unlike all other state-of-the-art approaches we suggest to model the covariance matrix of the observation probability density function by a dictionary model. We propose to learn the noise dictionary from training data, which can be gathered either offline or online whenever the system is not excited, while we infer the activations continuously. The proposed algorithm represents a novel machine-learning based approach to noise-robust adaptation control which allows for faster convergence in applications characterized by high-level and non-stationary interfering noise signals and abrupt system changes.

Full PDF

11 Noise-Robust Adaptation Control for SupervisedSystem Identiﬁcation Exploiting A Noise Dictionary

Thomas Haubner,

Student Member, IEEE , Andreas Brendel,

Student Member, IEEE , Mohamed Elminshawi, andWalter Kellermann,

Fellow, IEEE

Abstract —We present a noise-robust adaptation control strat-egy for block-online supervised acoustic system identiﬁcation byexploiting a noise dictionary. The proposed algorithm takes ad-vantage of the pronounced spectral structure which characterizesmany types of interfering noise signals. We model the noisyobservations by a linear Gaussian Discrete Fourier Transform-domain state space model whose parameters are estimated by anonline generalized Expectation-Maximization algorithm. Unlikeall other state-of-the-art approaches we suggest to model thecovariance matrix of the observation probability density functionby a dictionary model. We propose to learn the noise dictionaryfrom training data, which can be gathered either ofﬂine oronline whenever the system is not excited, while we infer theactivations continuously. The proposed algorithm represents anovel machine-learning-based approach to noise-robust adapta-tion control for challenging online supervised acoustic systemidentiﬁcation applications characterized by high-level and non-stationary interfering noise signals.

Index Terms —System Identiﬁcation, Adaptation Control, Non-negative Matrix Factorization, Acoustic Echo Cancellation

I. I

NTRODUCTION

Online Supervised Acoustic System Identiﬁcation (OSASI)is required for many modern hands-free acoustic human-machine interfaces, e.g., for the purpose of echo cancellation[1]. During recent decades, a multitude of OSASI algorithmshas been developed which originated from the plain time-domain Least Mean Square algorithm [2] and evolved to so-phisticated implementations operating in the block-frequencydomain [3], [4], [5]. In this context, robust OSASI typicallyhas to overcome interfering signals which are either undesired,e.g., trafﬁc noise, or desired, e.g., near-end talkers. Both typesof interference are termed noise in the following.Noisy observations are usually addressed by adaptation con-trol mechanisms which are motivated by the non-stationarityof many acoustic excitation and noise signals. A popularapproach for iterative OSASI algorithms are Variable StepSize (VSS) control methods which lead to either binary orcontinuous-valued step sizes. As binary VSS control, i.e.,halting the ﬁlter adaptation during periods with high-levelinterfering noise [6], [7], [8], does not allow for permanentﬁlter optimization, it is less suited for time-varying acoustic

Manuscript received July 3, 2020.This work was supported by the DFG under contract no < Ke890/10-2 > within the Research Unit FOR2457 ”Acoustic Sensor Networks.T. Haubner, A. Brendel, M. Elminshawi, and W. Kellermannare with the Chair of Multimedia Communications and SignalProcessing (LMS), University of Erlangen-Nuremberg, 91058 Erlangen,Germany (e-mail: [email protected]; [email protected];[email protected]; [email protected]). environments and applications with persistently high-levelinterfering noise signals. Thus, we focus in this paper oncontinuous VSS control which stipulates a permanent ﬁlteradaptation. Many VSSs have been developed, ranging fromscalar time-domain step sizes [9], [10], [11], to frequency-dependent step sizes which take into account the temporalcorrelation of the input and the noise signals [12], [13], [14]and thus often result in faster convergence. In particular, theinference of the adaptive ﬁlter coefﬁcients by a Kalman ﬁlter[11], [13], [14] has proven to be a powerful approach to VSScontrol. The robustness of these algorithms against interferingsignals crucially depends on precise estimates of the powerof the process noise and the observation noise, respectively.In [14] it is proposed to estimate both jointly with theadaptive ﬁlter coefﬁcients by optimizing a single Maximum-Likelihood (ML) objective function. However, due to thehigh-dimensional linear Gaussian Discrete Fourier Transform(DFT)-domain state space model [13], [14], it is difﬁcult toobtain a precise noise power estimate if the adaptive ﬁlter hasnot converged or the noise signals are highly non-stationary.In this paper we address this problem by introducing anonnegative noise dictionary model which we term State-Space Frequency-Domain Adaptive Filter with an Nonnega-tive Matrix Factorization Noise Model (SSFDAF-NMF). Theproposed model captures the pronounced spectral structurewhich characterizes many types of interfering noise signals,e.g., wind noise [15], music [16], speech [17] or robotic ego-noise [18]. We suggest to estimate the noise dictionary fromtraining data and infer continuously its activation by a gen-eralized Expectation-Maximization (EM) algorithm [19]. Theoptimization of the model shows a close relation to Itakura-Saito (IS) divergence-based NMF. The proposed algorithmrepresents a computationally efﬁcient block-online supervisedadaptive ﬁlter with inherent noise-robust VSS control.We use bold lowercase letters for vectors and bold upper-case letters for matrices with underlined symbols indicatingtime-domain quantities. The D -dimensional identity matrixis denoted by I D , the D -dimensional DFT matrix by F D and the zero-matrix of dimensions D × D by D × D .The superscripts ( · ) T and ( · ) H represent transposition andHermitian transposition, respectively. We denote the determi-nant by det ( · ) , the trace by Tr ( · ) , and introduce the diag ( · ) operator which creates a diagonal matrix from its vector-valued argument. Finally, we use c = to denote equality up toa constant and ◦ for indicating element-wise operations. a r X i v : . [ ee ss . A S ] J u l II. O

NLINE M AXIMUM -L IKELIHOOD L EARNING

As the proposed algorithm is based on the noisy linearfrequency-domain echo path model described in [13], andrelated to the online inference of its parameters introducedin [14], we give in the following section a short summary.

A. Linear Gaussian DFT-Domain State-Space Model

We model the noisy R -dimensional time-domain observa-tion vector y τ = Q T F − M X τ F M Q w τ + s τ ∈ R R (1)at block-index τ by an overlap-save convolution of the inputsignal x τ = (cid:0) x τR − M +1 , x τR − M +2 , . . . , x τR (cid:1) T ∈ R M (2)of even length M and block-shift R with the Finite ImpulseResponse (FIR) ﬁlter w τ ∈ R M − R which is superimposed bythe noise vector s τ = (cid:0) s τR − R +1 , s τR − R +2 , . . . , s τR (cid:1) T ∈ R R . (3)We used here the diagonal matrix X τ = diag ( F M x τ ) ∈ C M × M , which represents the DFT-transformed input signal block x τ , the linear convolutionconstraint matrix Q T = (cid:0) R × M − R I R (cid:1) and the zero-padding matrix Q T = (cid:0) I M − R M − R × R (cid:1) . By transformingthe zero-padded time-domain signals in Eq. (1) to the DFTdomain, i.e., y τ = F M Q y τ ∈ C M and s τ = F M Q s τ ∈ C M , (4)we obtain the frequency-domain observation equation y τ = C τ w τ + s τ (5)with the DFT-transformed FIR ﬁlter w τ = F M Q w τ ∈ C M and the overlap-save constrained input signal C τ = F M Q Q T F − M X τ . Here, we model the frequency-domain observation noise vector s τ as non-stationary,block-wise and spectrally uncorrelated zero-mean complexGaussian random process which is distributed according tothe probability density function (pdf) p ( s τ ) = p ( s τ | S τ − ) = N c ( s τ | M × , Ψ Sτ ) , (6)with S τ − = (cid:0) s , . . . , s τ − (cid:1) . The diagonal entries ofthe noise covariance matrix (cid:104) Ψ Sτ (cid:105) mm = E [ s mτ s ∗ mτ ] , with E [ · ] denoting the expectation operator, approximate the noisePower Spectral Density (PSD) at block τ .In [13], it is proposed to model the temporal evolution ofthe DFT-transformed FIR ﬁlter w τ in terms of a random-walkMarkov model with a stationary and diagonal process noisecovariance matrix Ψ ∆ τ . This allows to describe the OSASIproblem by the linear Gaussian DFT-domain state space model w τ = A w τ − + ∆ w τ with ∆ w τ ∼ N c (∆ w τ | M × , Ψ ∆ τ ) (7) y τ = C τ w τ + s τ with s τ ∼ N c ( s τ | M × , Ψ Sτ ) , (8)with the state transition coefﬁcient < A < . B. Online Inference

In [14] it is proposed to infer the state-space model pa-rameters in ˜Θ τ = { Ψ Sτ , Ψ ∆ τ } by optimizing the ML objectivefunction ˜ C ML ( ˜Θ τ ) = log p ( y τ | Y τ − , ˜Θ τ ) (9)with Y τ − = (cid:0) y , . . . , y τ − (cid:1) by an EM algorithmwhich we term State-Space Frequency-Domain Adaptive Filter(SSFDAF). The adaptive ﬁlter coefﬁcient vector w τ is treatedas latent random vector of the log-likelihood (9) which leadsvia Jensen’s inequality to the lower bound [14] ˜ C ML ( ˜Θ τ ) = log (cid:90) p ( y τ , w τ | Y τ − , ˜Θ τ ) d w τ (10) ≥ (cid:90) q ( w τ ) log p ( y τ , w τ | Y τ − , ˜Θ τ ) q ( w τ ) d w τ = ˜ Q ( q, ˜Θ τ ) for the log-likelihood function with q ( w τ ) being some pdf.The l -th E-step of the algorithm, i.e., the variational optimiza-tion w.r.t. q ( w τ ) , is addressed by the Kalman ﬁlter update ˆ w + τ − , ( l ) = A ˆ w τ − , ( L ) P + τ − , ( l ) = A P τ − , ( L ) + Ψ ∆ τ, ( l ) Λ τ, ( l ) = P + τ − , ( l ) (cid:18) X τ P + τ − , ( l ) X H τ + MR Ψ Sτ, ( l ) (cid:19) − ˆ w τ, ( l ) = ˆ w + τ − , ( l ) + Λ τ, ( l ) X H τ (cid:16) y τ − C τ ˆ w + τ − , ( l ) (cid:17) P τ, ( l ) = (cid:20) I M − RM Λ τ, ( l ) X H τ X τ (cid:21) P + τ − , ( l ) (11)with ˆ w τ, ( l ) being the posterior mean and P τ, ( l ) the cor-responding diagonal state uncertainty [13]. Note that thefrequency-dependent step sizes, contained in the diagonal ma-trix Λ τ, ( l ) , depend on the predicted state uncertainty P + τ − , ( l ) ,the DFT-domain input signal X τ and on the estimated noisecovariance matrix Ψ Sτ, ( l ) . By inserting the optimum function q ( l ) , i.e., the posterior pdf resulting from the Kalman ﬁlterEqs. (11), into (10) we obtain the lower bound [14] ˜ Q ( q ( l ) , ˜Θ τ ) c = ˜ Q ∆ ( q ( l ) , Ψ ∆ τ ) + ˜ Q S ( q ( l ) , Ψ Sτ ) (12)with ˜ Q ∆ ( q ( l ) , Ψ ∆ τ ) c = − log det (cid:0) P + τ − (cid:1) (13) − Tr (cid:16)(cid:0) P + τ − (cid:1) − E q ( l ) (cid:104) ( w τ − A ˆ w τ − ) ( w τ − A ˆ w τ − ) H (cid:105)(cid:17) ˜ Q S ( q ( l ) , Ψ Sτ ) c = − log det (cid:16) Ψ Sτ (cid:17) (14) − Tr (cid:18)(cid:16) Ψ Sτ (cid:17) − E q ( l ) (cid:104) ( y τ − C τ w τ ) ( y τ − C τ w τ ) H (cid:105)(cid:19) . Finally, in the M-step the optimum parameters in ˜Θ τ = { Ψ Sτ , Ψ ∆ τ } are obtained by [14] Ψ ∆ τ, ( l ) = (1 − A ) (cid:16) ˆ w τ, ( l ) ˆ w H τ, ( l ) + P τ, ( l ) (cid:17) (15) Ψ Sτ, ( l ) = e τ, ( l ) e H τ, ( l ) + RM X τ P τ, ( l ) X H τ (16)with the a-posteriori error e τ, ( l ) = y τ − C τ ˆ w τ, ( l ) . Byassuming Ψ ∆ τ, ( l ) and Ψ Sτ, ( l ) to be diagonal, the off-diagonalterms of the outer products in Eqs. (15) and (16) are neglected. III. P

ROPOSED

SSFDAF-NMF A

LGORITHM

In this section we describe the proposed VSS control byexploiting a nonnegative dictionary noise model.

A. Probabilistic Nonnegative Dictionary Noise Model

Here, in contrast to [14], and all other state-of-the-art VSScontrol strategies, we model the observation noise covariancematrix by a nonnegative dictionary model Ψ Sτ = diag ( T v τ ) . (17)Hence, the observation model (8) is parametrized by thenonnegative dictionary matrix T ∈ R M × K ≥ which contains K noise atoms and the respective activation vector v τ ∈ R K ≥ . Wesuggest to employ only the activation vector v τ of the noisemodel and the state transition noise covariance matrix Ψ ∆ τ asstate-space model parameters Θ τ = { v τ , Ψ ∆ τ } which need tobe inferred continuously online from the noisy observations.The dictionary matrix T is estimated from a time-domaintraining data vector s Ttr = (cid:0) s T , . . . s T J (cid:1) ∈ R JR which canbe obtained either ofﬂine, if prior knowledge about the ex-pected noise type is available, or online otherwise. In thelatter case we append the observed signal y τ to s tr , i.e., s Ttr ← (cid:0) s Ttr , y T τ (cid:1) , whenever the input signal is not active, i.e., x τ ≈ M × , and thus y τ ≈ s τ (cf. Eq. (1)). The time-domaintraining data vector s tr is straightforwardly transformed to thecorresponding Short-Time Fourier Transform (STFT) trainingdata matrix S tr = (cid:0) s tr , . . . s tr ,N (cid:1) ∈ C M × N , by decom-posing s tr into N blocks s tr ,τ ∈ R M of length M with blockshift R tr , followed by windowing and DFT transformation.Note that in comparison to s τ (cf. Eq. (4)), the time-domaintraining data blocks s tr ,τ are not zero-padded to increase thespectral resolution of the corresponding non-stationary noisePSD samples | s tr ,τ | ◦ where τ = 1 , . . . , N . We propose toestimate the dictionary matrix T by maximizing the log-likelihood log p ( S tr ) = (cid:81) Nτ =1 p ( s tr ,τ ) which is equivalent tomaximizing the IS-NMF objective function [16] C IS ( T , V tr ) c = − M,N (cid:88) m,τ =1 (cid:32) log K (cid:88) k =1 t mk v tr ,kτ + | s tr ,mτ | (cid:80) Kk =1 t mk v tr ,kτ (cid:33) (18)with s tr ,mτ = [ S tr ] mτ , t mk = [ T ] mk and v tr ,kτ = [ V tr ] kτ .This allows to optimize the dictionary matrix T by themultiplicative IS-NMF update rules [20] V tr ← V tr ◦ (cid:34) T T (cid:0) ( T V tr ) ◦− ◦ | S tr | ◦ (cid:1) T T ( T V tr ) ◦− (cid:35) ◦ (19) T ← T ◦ (cid:34) (cid:0) ( T V tr ) ◦− ◦ | S tr | ◦ (cid:1) V trT ( T V tr ) ◦− V trT (cid:35) ◦ (20)which represent an instance of the Minorize-Maximization(MM) algorithm [21]. Note that the training data activationmatrix V tr is only needed for estimating the dictionary andthat due to the conjugate symmetry of s τ and s tr ,τ , it issufﬁcient to compute the non-redundant part of the dictionary,i.e., ˜ T = (cid:16) I M +1 M +1 × M − (cid:17) T . B. Online Inference

We propose to estimate the parameters in Θ τ by maximizingthe log-likelihood function C ML (Θ τ ) = log p ( y τ | Y τ − , Θ τ ) . (21)Due to the linear model, we obtain a tight lower bound on theactivation vector v τ by inserting the dictionary model (17)into (14) and evaluating the expectation Q S ( q ( l ) , v τ ) c = − M (cid:88) m =1 (cid:32) log K (cid:88) k =1 t mk v kτ + ψ emτ, ( l ) (cid:80) Kk =1 t mk v kτ (cid:33) (22)with ψ emτ, ( l ) = (cid:104) e τ, ( l ) e H τ, ( l ) + C τ P τ, ( l ) C H τ (cid:105) mm representingthe expected posterior error power. By comparing Eq. (22) toEq. (18), we observe that the lower bound on the activationvector v τ is an IS-NMF objective function with the expectedposterior error power ψ emτ, ( l ) as target variable. The lowerbound Q S ( q ( l ) , v τ ) does, in comparison to ˜ Q S ( q ( l ) , Ψ Sτ ) (cf. Eq. (14)), not allow for a closed-form solution as givenby Eq. (16). Thus, we suggest to optimize it iteratively byapplying P times the multiplicative update (19) which yieldsa series of non-decreasing values of the objective function [20].Hence, the proposed SSFDAF-NMF represents a generalizedEM algorithm [19].For updating the process noise covariance matrix Ψ ∆ τ , weﬁrst observe that the optimization w.r.t. Ψ ∆ τ is independentof v τ which results from the additive structure of the lowerbound (12). Hence, we can use Eq. (15). C. Algorithmic Description

Alg. 1 summarizes the proposed SSFDAF-NMF algorithm.For each block of data indexed by τ , L EM steps are carriedout. The E-Step updates the posterior mean ˆ w τ, ( l ) and therespective state uncertainty matrix P τ, ( l ) of the latent adaptiveﬁlter coefﬁcients by the Kalman ﬁlter Eqs. (11). Subsequently,the expected posterior error power ψ emτ, ( l ) is computed whichcan efﬁciently be approximated by ψ emτ, ( l ) ≈ (cid:104) Ψ Sτ, ( l ) (cid:105) mm [13]. In the M-Step, the diagonal process noise covariancematrix Ψ ∆ τ and the dictionary activation v τ are updated. Wepropose to initialize the iterative optimization procedure forestimating v τ, ( l ) with the optimum activation vector v τ − , ( L ) of the previous time frame. Finally, the observation noisecovariance matrix Ψ Sτ, ( l ) is updated by Eq. (17). Algorithm 1

OSASI by SSFDAF-NMF. for τ = 1 , . . . , T dofor l = 1 , . . . , L do E-Step: Update ˆ w τ, ( l ) and P τ, ( l ) by Kalman ﬁlt. (11)M-Step:Update state cov. matrix Ψ ∆ τ, ( l ) by Eq. (15)Compute exp. post. error power Ψ Sτ, ( l ) by Eq. (16)Opt. dict. act. v τ, ( l ) by applying Eq. (19) P timesUpdate noise cov. matrix Ψ Sτ, ( l ) ← diag (cid:0) T v τ, ( l ) (cid:1) end forend for IV. E

XPERIMENTS

In this section the proposed SSFDAF-NMF algorithm isevaluated for an acoustic echo cancellation scenario which isexposed to various types of recorded noise signals. The respec-tive echoes were simulated by convolving speech source sig-nals, i.e., randomly-selected talkers of [22], with measuredroom impulse responses (RIRs) h ∈ R D of length D = 48000 and sampling frequency f s = 16 kHz. The RIRs were takenfrom the AIR database [23] and describe a hands-free interac-tion of a human with a mock-up phone in meeting room, ofﬁceand lecture environments with T = { , , } ms, re-spectively. The clean echo signals were distorted by addingrecorded noise signals, which were scaled according to thedesired Signal-to-Noise Ratio (SNR). As noise signals weconsidered near-end speech (SP), chosen from additional different talkers of [22], and trafﬁc noise (TR), from theCHiME 3 challenge [24], and a mixture of both (SP+TR).The speech signal of the speech-trafﬁc mixture was scaled to times the power of the trafﬁc noise signal before the sumsignal was scaled to generate a desired SNR.As performance measure we used the block-dependentlogarithmic system mismatch [1] Υ τ = 10 log || h − ˆ w τ || || h || (23)with ˆ w τ = Q T F − M ˆ w τ and || · || denoting the p2-norm. Notethat we only use the ﬁrst M − R taps of the true RIR h in Eq. (23) to obtain an estimate of the attainable systemmismatch. The echo caused by the remaining D − ( M − R ) taps is considered as noise in the observation model (8).As baseline approach we use the state-of-the-art SSFDAFalgorithm [14]. For both algorithms, i.e., SSFDAF and the pro-posed SSFDAF-NMF, we chose the state transition coefﬁcient A = 0 . , the block-length M = 1536 and the block-shift R = 512 , corresponding to a ﬁlter length M − R = 1024 , and L = 2 EM iterations. For training the dictionaries we chose aset of s-long signals which are temporally complementary tothe evaluation data. The near-end speech (SP) training data didneither include the input nor the local speech signal used fortesting the algorithms. For computing the training data matrix S tr we used a Hamming window and a frame-shift R tr = 512 .While for the trafﬁc noise scenario one basis vector and MMiteration proved to be sufﬁcient, the spectrally more diversespeech noise applications required multiple basis vectors andoptimization steps, see Tab. I.Fig.1 shows the block-dependent average logarithmic sys-tem mismatch ¯Υ τ and the respective standard deviation (cid:15) τ fordifferent noise signal types and SNR levels. The results havebeen computed by evaluating Monte Carlo experimentswith each experiment being deﬁned by randomly drawingan RIR, an input and interfering noise signal and a trainingTABLE I: Dictionary parameters and runtime performance ofboth algorithms for all considered scenarios.

Algorithm Noise Type

K P t block in ms t dic in sSSFDAF { TR, SP, TR+SP } — — . —SSFDAF-NMF { TR } .

72 0 . { SP, TR+SP }

10 3 1 .

94 3 . SSFDAF (SNR= − dB) SSFDAF-NMF (SNR= − dB)SSFDAF (SNR= dB) SSFDAF-NMF (SNR= dB)SSFDAF (SNR= dB) SSFDAF-NMF (SNR= dB) − − − Trafﬁc ¯ Υ τ ± (cid:15) τ − − − Speech ¯ Υ τ ± (cid:15) τ − − − Trafﬁc+SpeechTime in s ¯ Υ τ ± (cid:15) τ Fig. 1: Performance evaluation of SSFDAF and the proposedSSFDAF-NMF for different noise types and SNR levels.noise signal. As can be concluded from Fig. 1 the proposedSSFDAF-NMF outperforms the baseline signiﬁcantly in termsof both convergence rate, which is approximately doubled,and steady-state performance in all considered scenarios. Itssuperior robustness with respect to varying input and noisesignals becomes obvious from the persistently small standarddeviation. Furthermore, the average runtime to process oneinput signal block t block , on an Intel(R) Xeon(R) CPU E3-1275v6 @ 3.80GHz , is summarized in Tab. I. The SSFDAF-NMFrequires only slightly more runtime in comparison to theSSFDAF. Finally, Tab. I shows the training runtime t dic .Note that due to the speaker-independent training, the speechdictionary estimation can be computed ofﬂine.V. C ONCLUSION

We have introduced a novel computationally efﬁcient adap-tation control for block-OSASI by exploiting a nonnegativenoise dictionary. The proposed algorithm exhibits signiﬁcantlyfaster convergence properties and improved steady-state per-formance for different types of recorded noise signals incomparison to a high-performance state-of-the-art algorithmof similar complexity. R EFERENCES[1] G. Enzner, H. Buchner, A. Favrot, and F. Kuech, “Acoustic EchoControl,” in

Academic Press Library in Signal Processing , vol. 4, pp.807–877. Elsevier, FL, USA, 2014.[2] B. Widrow and M. E. Hoff, “Adaptive Switching Circuits,” in

Proc.WESCON Conv. Rec. , Los Angeles, USA, Aug. 1960, pp. 96–104.[3] E. Ferrara, “Fast implementations of LMS adaptive ﬁlters,”

IEEE Trans.Acoust. , vol. 28, no. 4, pp. 474–475, Aug. 1980.[4] D. Mansour and A. Gray, “Unconstrained frequency-domain adaptiveﬁlter,”

IEEE Trans. Acoust. , vol. 30, no. 5, pp. 726–734, Oct. 1982.[5] S. Haykin,

Adaptive Filter Theory , Prentice Hall, NJ, USA, 2002.[6] J. Benesty, D. Morgan, and J. Cho, “A new class of doubletalk detectorsbased on cross-correlation,”

IEEE Trans. Speech Audio Process. , vol. 8,no. 2, pp. 168–172, Mar. 2000.[7] E. H¨ansler and G. Schmidt,

Acoustic Echo and Noise Control: Apractical Approach , Wiley-Interscience, NJ, USA, 2004.[8] N. Cahill and R. Lawlor, “An Approach to Doubletalk Detection Basedon Non-Negative Matrix Factorization,” in

Proc. IEEE Int. Symp. SignalProc. Inf. Tech. , Sarajevo, Bosnia and Herzegovina, Dec. 2008, pp. 497–501.[9] J. Benesty, H. Rey, L. R. Vega, and S. Tressens, “A NonparametricVSS NLMS Algorithm,”

IEEE Signal Process. Lett. , vol. 13, no. 10,pp. 581–584, 2006.[10] H. Huang and J. Lee, “A new variable step-size nlms algorithm and itsperformance analysis,”

IEEE Trans. Signal Process. , vol. 60, no. 4, pp.2055–2060, 2012.[11] C. Huemmer, R. Maas, and W. Kellermann, “The nlms algorithmwith time-variant optimum stepsize derived from a bayesian networkperspective,”

IEEE Signal Process. Lett. , vol. 22, no. 11, pp. 1874–1878, 2015.[12] B. H. Nitsch, “A frequency-selective stepfactor control for an adaptiveﬁlter algorithm working in the frequency domain,”

Signal Process. , vol.80, no. 9, pp. 1733–1745, 2000.[13] G. Enzner and P. Vary, “Frequency-domain adaptive Kalman ﬁlter foracoustic echo control in hands-free telephones,”

Signal Process. , vol.86, no. 6, pp. 1140–1156, June 2006.[14] S. Malik and G. Enzner, “Online maximum-likelihood learning of time-varying dynamical models in block-frequency-domain,” in

Proc. Int.Conf. Acoust., Speech, Signal Process. , Dallas, USA, Mar. 2010, pp.3822–3825.[15] M. N. Schmidt, J. Larsen, and F.-T. Hsiao, “Wind Noise Reductionusing Non-Negative Sparse Coding,” in

IEEE Int. Workshop Mach.Learn. Signal. Process. , Thessaloniki, Greece, Aug. 2007, pp. 431–436.[16] C. F´evotte, N. Bertin, and J.-L. Durrieu, “Nonnegative Matrix Factor-ization with the Itakura-Saito Divergence: With Application to MusicAnalysis,”

Neural Comput. , vol. 21, no. 3, pp. 793–830, Mar. 2009.[17] P. Smaragdis, “Convolutive Speech Bases and Their Application toSupervised Speech Separation,”

IEEE/ACM Trans. Audio, Speech, Lang.Process. , vol. 15, no. 1, pp. 1–12, Jan. 2007.[18] T. Haubner, A. Schmidt, and W. Kellermann, “Multichannel NonnegativeMatrix Factorization for Ego-Noise Suppression,” in

ITG-Symp. SpeechCommun. , Oldenburg, Germany, Oct. 2018.[19] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihoodfrom Incomplete Data Via the EM Algorithm,”

J. R. Stat. Soc. Series BStat. Methodol. , vol. 39, no. 1, pp. 1–22, Sept. 1977.[20] C. F´evotte and J. Idier, “Algorithms for nonnegative matrix factorizationwith the β -divergence,” Neural Comput. , vol. 23, no. 9, pp. 2421–2456,2011.[21] D. R. Hunter and K. Lange, “A tutorial on MM algorithms,”

Am. Stat. ,vol. 58, no. 1, pp. 30–37, 2004.[22] L. M. Panﬁli, J. Haywood, D. R. McCloy, P. E. Souza, and R. A.Wright, “The UW/NU corpus, version 2.0,” https://depts.washington.edu/phonlab/projects/uwnu.php, 2017.[23] M. Jeub, M. Sch¨afer, H. Kr¨uger, C. Nelke, C. Beaugeant, and P. Vary,“Do We Need Dereverberation for Hand-Held Telephony?,” in

Int.Congr. Acoust. , Sydney, Australia, 2010.[24] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The Third ‘CHiME’Speech Separation and Recognition challenge: Analysis and Outcomes,”