Bayesian Non-Parametric Multi-Source Modelling Based Determined Blind Source Separation
BBAYESIAN NON-PARAMETRIC MULTI-SOURCE MODELLING BASEDDETERMINED BLIND SOURCE SEPARATION
Chaitanya Narisetty, Tatsuya Komatsu, and
Reishi Kondo
Data Science Research Laboratories, NEC Corporation, Japan
ABSTRACT
This paper proposes a determined blind source separationmethod using Bayesian non-parametric modelling of sources.Conventionally source signals are separated from a given setof mixture signals by modelling them using non-negative ma-trix factorization (NMF). However in NMF, a latent variablesignifying model complexity must be appropriately specifiedto avoid over-fitting or under-fitting. As real-world sourcescan be of varying and unknown complexities, we proposea Bayesian non-parametric framework which is invariant tosuch latent variables. We show that our proposed methodadapts to different source complexities, while conventionalmethods require parameter tuning for optimal separation.
Index Terms — Blind Source Separation, Non-NegativeMatrix Factorization, Bayesian Non-Parametrics, Inference
1. INTRODUCTION
The ubiquity of smart devices in our present day provide adiverse range of audio based applications like detecting threatindicating sounds, transcribing speech, building AI assistantsetc. to facilitate our day-to-day activities [1]. Depending onthe application, our desired information extracted from theaudio signals changes. Real world audio signals are often ob-scured by undesired information and necessitate the develop-ment of source separation methods [2]. As most smart devicesare equipped with two or more microphones, the recordedmulti-channel data can be leveraged to improve the separationperformance [3]. However the arrangement of microphonesvaries across devices, as do the source characteristics and lo-cations. Separating sources without such device or source in-formation is termed as blind source separation (BSS) [4,5].Fundamental BSS techniques include matrix factoriza-tion methods like independent component analysis (ICA) [6],non-negative matrix factorization (NMF) [7] etc. These tech-niques formulate the audio signals captured by microphonesas linear mixtures of two or more source signals. In otherwords, the spectra of mixture signals are treated as linearcombinations of spectra of source signals, and solved byassuming independence among the distributions of sourcespectra. This assumption of independence is often not truebecause the spectra of a typical audio source is highly corre- lated with itself across different time intervals. Such corre-lations or fundamental patterns in each source are efficientlyextracted using NMF [8], which are referred to as bases. Theway these bases are linearly combined to reconstruct spectra,are referred to as activations and are extracted simultaneouslyby NMF. BSS methods are therefore extended by modellingsuch correlations using multi-channel NMF [7,9]. It is in-tuitive that source separation becomes more reliable as thenumber of available microphones increases. When there areas many number of microphones as there are sources, thesituation is said to be determined. Ono et. al. [10] proposedindependent vector analysis (IVA) for determined BSS us-ing an auxiliary function technique to derive stable and fastupdate rules for demixing parameters. Kitamura et. al. [11]proposed an independent low-rank matrix analysis method(ILRMA) by unifying IVA and multi-source NMF modelling.In the above discussed NMF-based BSS methods, it isrequired to provide an appropriate value for NMF’s modelcomplexity i.e. the number of bases extracted for each sourcemodel. This latent variable when too low or too high, leadsto under-fitting or over-fitting respectively. Its value is oftenchosen depending on source characteristics or by parametertuning. However real world source characteristics are un-known, and their complexities range from simple keyboardclicking to moderate sounds of drums to complex musicpieces. Although non-parametric methods exist for estimat-ing number of sources, to the best of our knowledge, therearen’t any NMF-based BSS techniques which can adaptivelymodel sources having different complexities.We propose a non-parametric framework of multi-sourcemodelling unified with IVA for determined BSS, and over-come the problem of tuning NMF’s model complexity param-eter. Our proposed method utilizes the concepts of variationalBayesian inference to statistically estimate each source’scomplexity, thereby optimally separating the sources. Pro-posed method therefore serves as a generalization of ILRMAby adapting to varying source complexities.The remainder of this paper is organized as follows. Sec-tion 2 details the conventional method: ILRMA. Section 3formulates our proposed method. Section 4 details the exper-imental results and discusses a few potential implications ofthe proposed method. Finally section 5 concludes our work. a r X i v : . [ c s . S D ] A p r . CONVENTIONAL METHOD Most BSS methods model the given mixture spectra as linearcombinations of source spectra and are formulated as xxx ij = AAA i sss ij , (1)where xxx ij = ( x ij, , . . . , x ij,M ) T and sss ij = ( s ij, , . . . , s ij,N ) T denote the mixture and source spectra respectively for eachindex i ∈ { , , . . . , I } , j ∈ { , , . . . , J } . ( . ) T denotes atranspose and I, J, M, N denote the number of frequencybins, time frames, microphones and sources respectively. A i is an M × N mixing matrix comprising of N steeringvectors for the N respective sources. In a determined case, M = N and the square matrix AAA i has a valid inverse matrix.Therefore [10] proposed IVA by defining a demixing matrix WWW i = AAA − i = ( i, i,M ) H and reformulated Eq. (1) as yyy ij = WWW i xxx ij , (2)where yyy ij = ( y ij, , . . . , y ij,M ) T denotes the estimated sourcespectra and ( . ) H denotes a hermitian transpose. This method extends IVA by independently modelling eachsource with an isotropic complex Gaussian distribution. Foreach source index m ∈ { , . . . , M } , the distribution variancedenoted as r ij,m is non-negative and modelled using NMF as r ij,m = K m (cid:88) k =1 t ik,m v kj,m , (3)where t ik,m and v kj,m are the elements of basis and activationvectors respectively. K m is NMF’s model complexity param-eter, signifying the number of basis vectors for m th source.The cost function Q of ILRMA is derived in [11] as Q = − J (cid:88) i | det WWW i | + (cid:88) i,j,m (cid:20) log r ij,m + | y ij,m | r ij,m (cid:21) . (4)ILRMA estimates the demixing parameters by maximizingcost function Q using multi-source NMF modelling in Eq. (3). In the formulation of ILRMA, note that a set of complex-ity parameters { K m , ≤ m ≤ M } is to be specified bythe user. Failing to provide a reasonable estimate of eachsource’s complexity will lead to over-fitting or under-fitting.This is discussed in [11] by comparing the cases of separat-ing speech signals with that of music signals. The formerrequires only bases for optimal modelling as compared tothe latter requiring more than . Therefore the model com-plexity parameter can effect separation performance, depend-ing on the types of sources being separated. For simplicity,ILRMA assumes equal number of bases for each source i.e. { K = · · · = K M = K } . This further limits ILRMA’s abil-ity to optimally separate a low complexity source and a highcomplexity source from their mixture signals.
3. PROPOSED MULTI-SOURCE MODELLING
We overcome the limitation of ILRMA by proposing a prob-abilistic modelling of the variance of source distributions. Insuch techniques, it is common to introduce hidden variablesto capture the structure of given observed data, and utilize in-ference algorithms to estimate the posterior distribution. Ac-cordingly, our proposed method flexibly models each sourceusing a large number of basis vectors K and incorporates a reliability value for each basis vector. We denote each relia-bility value as z k,m which can be interpreted as a quantifiedcontribution of k th basis towards m th source. Contrast to the NMF-based source modelling in Eq. (3), wepropose a probabilistic model for source variance r ij,m as r ij,m = K (cid:88) k =1 z k,m t ik,m v kj,m . (5)where the prior distributions for each of t ik,m , v kj,m and z k,m are drawn from a random process as p ( t ik,m ) ∼ Gamma ( a , a ) ,p ( v kj,m ) ∼ Gamma ( b , b ) ,p ( z k,m ) ∼ Gamma ( c , c m ) , (6)where a , b , c are positive constants, Gamma ( ., . ) is agamma distribution defined over a shape parameter and arate (inverse-scale) parameter. When c (cid:28) , a sparse prioris set over the reliability values and therefore adapts to dif-ferent model complexities depending on each source’s char-acteristics [12]. As each source’s expected variance shouldcorrespond to the expectation of its power, the choice of priorparameters require that E p [ | y ij,m | ] = E p [ r ij,m ] , ⇒ E p [ | y ij,m | ] = (cid:88) k E p [ z k,m ] E p [ t ik,m v kj,m ] = (cid:88) k ( c /c m ) , ⇒ c m = c K (cid:2) (cid:88) i (cid:88) j | y ij,m | / ( IJ ) (cid:3) − . (7)Maximizing the cost function in Eq. (4), as updated by Eq. (5)is the central focus of our approach. The isotropic gaussian distribution of sources is not conjugateto the gamma distribution of source parameters t, v, z . Thisnon-conjugacy precludes the use of Markov chain MonteCarlo (MCMC) based Metropolis-Hastings, Gibbs sam-pling for inferring our desired posterior [13,14]. Howevervariational approaches have provided alternatives for non-conjugate models that are also faster for large amounts ofata. We adopt a fully factorized mean-field variational in-ference technique as it assumes conditional independenceamong the hidden variables [15] and approximates them froma family of conditional distributions over variational param-eters [16]. Generalized inverse Gaussian (GIG) distributionsare chosen as our variational family and are expressed asGIG ( θ | γ, ρ, τ ) = exp { ( γ −
1) log θ − ρθ − τ /θ } τ /ρ ) γ/ K γ (2 √ ρτ ) , (8)where K ( . ) is a modified Bessel function of the second kindand γ, ρ, τ are the variational hyper-parameters. GIG’s suffi-cient statistics includes (1 /θ ) which eases the optimization ofour cost function’s (1 /r ij,m ) term and motivates our choiceof GIG [17]. We define the conditional distributions as q ( t ik,m | Θ \ t ik,m ) ∼ GIG ( a , ρ ( t ) ik,m , τ ( t ) ik,m ) ,q ( v kj,m | Θ \ v kj,m ) ∼ GIG ( b , ρ ( v ) kj,m , τ ( v ) kj,m ) ,q ( z k,m | Θ \ z k,m ) ∼ GIG ( c , ρ ( z ) k,m , τ ( z ) k,m ) . (9)We now derive update equations from the cost function inEq. (4) using first-order Taylor expansion and Jensen’s in-equality [18] by introducing their respective auxiliary positiveconstants α ij,m and β ijk,m as Q + 2 J (cid:88) i | det WWW i | = (cid:88) i,j,m (cid:20) log r ij,m + | y ij,m | r ij,m (cid:21) , ≤ (cid:88) i,j,m E q (cid:20) log r ij,m + | y ij,m | r ij,m (cid:21) + E q (cid:20) log q ( t | Θ \ t ) p ( t | a ) (cid:21) + E q (cid:20) log q ( v | Θ \ v ) p ( v | b ) (cid:21) + E q (cid:20) log q ( z | Θ \ z ) p ( z | c , c m ) (cid:21) , ≤ (cid:88) i,j (cid:20) (cid:88) k,m | y ij,m | β ijk,m E q (cid:2) z − k,m t − ik,m v − kj,m (cid:3) − α ij,m + 1 α ij,m (cid:88) k E q [ z k,m t ik,m v kj,m ] (cid:21) + E q [ − ρ ( t ) ik,m t ik,m − τ ( t ) ik,m /t ik,m + a t ik,m ]+ E q [ − ρ ( v ) kj,m v kj,m − τ ( v ) kj,m /v kj,m + b v kj,m ]+ E q [ − ρ ( z ) k,m z k,m − τ ( z ) k,m /z k,m + c m z k,m ] + C, (10)where C denotes a leftover constant. Note that the pairs ( t, t − ) , ( v, v − ) and ( z, z − ) are sufficient statistics for theirrespective GIG distributions. This allows us to avoid takingpartial derivatives and directly derive the analytic coordinateascent updates for our hyper-parameters by comparing thecoefficients of sufficient statistics [17]. Constants α ij,m and β ijk,m re-tighten the above inequality (10) when: α ij,m = (cid:88) k E q [ z k,m ] E q [ t ik,m ] E q [ v kj,m ] , (11) β ijk,m = E q (cid:2) z − k,m (cid:3) E q (cid:2) t − ik,m (cid:3) E q (cid:2) v − kj,m (cid:3)(cid:80) k E q (cid:2) z − k,m (cid:3) E q (cid:2) t − ik,m (cid:3) E q (cid:2) v − kj,m (cid:3) . (12) Expectation of source parameters in Eqs. (11) and (12) canbe obtained similar to the expectation of a random variable θ defined over a GIG distribution in Eq. (8) as E [ θ ] = K γ +1 (2 √ ρτ ) √ τ K γ (2 √ ρτ ) √ ρ , E (cid:20) θ (cid:21) = K γ − (2 √ ρτ ) √ ρ K γ (2 √ ρτ ) √ τ . (13)The update equations for our hyper-parameters are derived as ρ ( t ) ik,m = a + E q [ z k,m ] (cid:88) j E q [ v kj,m ] α − ij,m , (14) τ ( t ) ik,m = (cid:88) j | y ij,m | β ijk,m E q (cid:2) z − k,m (cid:3) E q (cid:2) v − kj,m (cid:3) , (15) ρ ( v ) kj,m = b + E q [ z k,m ] (cid:88) i E q [ t ik,m ] α − ij,m , (16) τ ( v ) kj,m = (cid:88) i | y ij,m | β ijk,m E q (cid:2) z − k,m (cid:3) E q (cid:2) t − ik,m (cid:3) , (17) ρ ( z ) k,m = c m + (cid:88) i (cid:88) j E q [ t ik,m ] E q [ v kj,m ] α − ij,m , (18) τ ( z ) k,m = (cid:88) i (cid:88) j | y ij,m | β ijk,m E q (cid:2) t − ik,m (cid:3) E q (cid:2) v − kj,m (cid:3) . (19) Given each source’s variance, the proposed method does notalter the partial derivatives of cost function Q over the demix-ing matrix WWW i . Hence its update equations coincide withthose described for IVA using an auxiliary function technique[10] and are derived as follows: V i,m = 1 J (cid:88) j r ij,m xxx ij xxx hij , (20) i,m ← ( WWW i V i,m ) − eee m , (21) i,m ← i,m ( hi,m V i,m i,m ) − / , (22)where eee m is a unit vector whose m th element equals one.After the elements of demixing matrix are estimated, the sep-arated source spectra can be extracted as y ij,m ← hi,m xxx ij . (23)In each iteration of the proposed method, hyper-parameters ρ ( . ) and τ ( . ) are updated using Eq. (14)-(19). Demixing ma-trices WWW i and the separated source spectra are then updatedusing Eq. (20)-(23). As each source’s modelling begins witha large K , our variational inference is computationally exact-ing. However over a few iterations, the sparse prior placedover z estimates a small number of bases which are reli-able. So we employ a thresholding technique [19] to skip theoptimization of less reliable bases in subsequent iterations.
4. SIMULATIONS AND RESULTS4.1. Experimental Conditions
We evaluate our proposed method on the DSD datasetcontaining professionally produced songs from the 2016
Experimental Setup
Source 1
Source 2 m
50° 50° cm Fig. 1 : E2A recording conditions (reverberation time: ms)SiSEC challenge [20]. Each song data consists of cleansounds for vocals, drums, bass and other accompaniments,and lasts for more than - minutes. So we only choose a second portion ( s- s time interval) of clean sources anddown-sample them to kHz for creating mixture signals.For each of these songs, we randomly choose betweenthe pairs (drums, vocals) or (bass, vocals) and create synthetictwo-channel reverberant mixture signals using the recodingconditions shown in Fig. 1. Room impulse responses E2A( T = 300 ms) for above recording conditions were obtainedfrom the RWCP Sound Scene Database [21].Mixture spectra are estimated from the time domain sig-nals using a Hamming window of length ms shifted ev-ery ms, which were found to be optimal by [22]. Eachdemixing matrix WWW i is initialized with an identity matrix.We model each source’s variance with K = 30 bases andset a = b = 0 . , c = 1 /K . Hyper-parameters ρ ( . ) and τ ( . ) are initialized randomly from gamma distributions with shapeand rate parameters set to . Parameters are optimized for iterations and then the separated spectra are converted totime domain using a back-projection technique [23]. Three metrics: signal to distortion ratio (SDR), signal to inter-ference ratio (SIR) and signal to artifacts ratio (SAR) [24] areused to evaluate the quality of separated sources. Separationperformance of the proposed method is compared with threeNMF-based BSS method i.e. IVA [10], MNMF [9] and IL-RMA [11] (with source bases). Each separation is repeatedfor different random initializations, and the average of theabove performance metrics are reported in Table 1. Table 1 : Comparison of Separation PerformanceMethods SDR SIR SARIVA [10] . dB . dB . dBMNMF [9] . dB . dB . dBILRMA [11] . dB . dB . dB Proposed . . . dB . . . dB . . . dB Although the proposed method outperforms ILRMA, it isimportant to verify that we overcome its limitation of havingto tune the complexity parameter K . This limitation can beseen in Fig. 2 where ILRMA’s SDR averaged over all themixture signals containing (bass, vocals) increases as K in-creases, while an opposite trend is seen for mixture signalscontaining (drums, vocals). The proposed method on theother hand starts with a large value for K ( = 30 in this case)and tunes itself depending on each source’s characteristics.Hence it optimally separates the sources from both types ofmixture signals. The choice of K , if sufficiently large, doesnot to significantly impact proposed method’s performance. Number of basis: K A v e r a g e S D R Bass and Vocals
ILRMAProposed 2 5 10 15 20 25 30
Number of basis: K
Drums and Vocals
ILRMAProposed
Fig. 2 : SDR for ILRMA using different number of bases ascompared with proposed method’s SDR for mixture signalsof bass and vocals (left), and of drums and vocals (right).
As each source is modelled individually using NMF, it is pos-sible to extend our approach by considering a common basisand activation matrix for capturing the variance of all sourcesi.e. r ij,m = (cid:80) k z k,m t ik v kj . This requires fewer parametersto be estimated and its computational complexity becomes atleast half as compared to that of the proposed method. Wenote that this extension is similar to Ozerov’s MNMF [7] andhas also been considered for ILRMA [11]. However the keydifference is that we do not restrict the overall contributionof each basis to be , i.e. (cid:80) m z k,m = 1 ∀ k . Due to spaceconstraints, this extension will be explored as a future work.
5. CONCLUSIONS
This work proposes a Bayesian generalization of ILRMAfor determined blind source separation by performing multi-source modelling using non-parametric NMF. Our formula-tion for individual source modelling is able to overcome thelimitation of conventional method, whose separation perfor-mance is effected by NMF’s model complexity parameter.Proposed approach is flexible in modelling sources of differ-ent complexities and is therefore able to optimally separatethem. We further show that our approach outperforms thestate-of-the-art NMF based techniques. . REFERENCES [1] J. Foote, “An overview of audio information retrieval,”
Multimedia systems , vol.7, no.1, pp.2–10, 1999.[2] M. Davies, “Audio source separation,” in
Instituteof mathematics and its applications conference series ,2002, vol. 71, pp. 57–68.[3] E. Weinstein, M. Feder, and A. V. Oppenheim, “Multi-channel signal separation by decorrelation,”
IEEE trans-actions on Speech and Audio Processing , vol.1, no.4,pp.405–413, 1993.[4] P. Comon and C. Jutten,
Handbook of Blind Source Sep-aration: Independent component analysis and applica-tions , Academic press, 2010.[5] X. Cao and R. Liu, “General approach to blind sourceseparation,”
IEEE Transactions on signal Processing ,vol.44, no.3, pp.562–571, 1996.[6] P. Comon, “Independent component analysis, a newconcept?,”
Signal processing , vol.36, no.3, pp.287–314,1994.[7] A. Ozerov, C. F´evotte, R. Blouet, and J. Durrieu, “Mul-tichannel nonnegative tensor factorization with struc-tured constraints for user-guided audio source separa-tion,” in
IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2011, pp. 257–260.[8] C. Narisetty, T. Komatsu, and R. Kondo, “Modelling ofsound events with hidden imbalances based on cluster-ing and separate Sub-Dictionary learning,” in , 2018.[9] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Mul-tichannel extensions of non-negative matrix factoriza-tion with complex-valued data,”
IEEE Transactions onAudio, Speech, and Language Processing , vol.21, no.5,pp.971–982, 2013.[10] N. Ono, “Stable and fast update rules for independentvector analysis based on auxiliary function technique,”in
IEEE Workshop on Applications of Signal Processingto Audio and Acoustics (WASPAA) , 2011, pp. 189–192.[11] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, andH. Saruwatari, “Determined blind source separationunifying independent vector analysis and nonnegativematrix factorization,”
IEEE/ACM Transactions on Au-dio, Speech and Language Processing (TASLP) , vol.24,no.9, pp.1622–1637, 2016.[12] V. Y. Tan and C. F´evotte, “Automatic relevancedetermination in nonnegative matrix factorization,” in
SPARS’09-Signal Processing with Adaptive SparseStructured Representations , 2009.[13] W. K. Hastings, “Monte carlo sampling methods usingmarkov chains and their applications,” 1970.[14] D. M. Blei, J. D. Lafferty, et al., “A correlated topicmodel of science,”
The Annals of Applied Statistics ,vol.1, no.1, pp.17–35, 2007.[15] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley,“Stochastic variational inference,”
The Journal of Ma-chine Learning Research , vol.14, no.1, pp.1303–1347,2013.[16] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K.Saul, “An introduction to variational methods for graph-ical models,”
Machine learning , vol.37, no.2, pp.183–233, 1999.[17] D. M. Blei, P. R. Cook, and M. Hoffman, “Bayesiannonparametric matrix factorization for recorded music,”in
Proceedings of the 27th International Conference onMachine Learning (ICML-10) , 2010, pp. 439–446.[18] J. D. Lafferty and D. M. Blei, “Correlated topic models,”in
Advances in neural information processing systems ,2006, pp. 147–154.[19] J. Paisley and L. Carin, “Nonparametric factor analy-sis with beta process priors,” in
ACM Proceedings ofthe 26th Annual International Conference on MachineLearning , 2009, pp. 777–784.[20] A. Liutkus, F. St¨oter, Z. Rafii, D. Kitamura, B. Rivet,N. Ito, N. Ono, and J. Fontecave, “The 2016 signal sepa-ration evaluation campaign,” in
Latent Variable Analysisand Signal Separation: 13th International Conference,LVA/ICA, Grenoble, France , 2017, pp. 323–332.[21] S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, andT. Yamada, “Acoustical sound database in real environ-ments for sound scene understanding and Hands-Freespeech recognition.,” in
LREC , 2017, pp. 1170–1174.[23] N. Murata, S. Ikeda, and A. Ziehe, “An approach toblind source separation based on temporal structure ofspeech signals,”
Neurocomputing , vol.41, no.1-4, pp.1–24, 2001.[24] E. Vincent, R. Gribonval, and C. F´evotte, “Performancemeasurement in blind audio source separation,”