Independent Vector Analysis with Deep Neural Network Source Priors
IINDEPENDENT VECTOR ANALYSIS WITH DEEP NEURAL NETWORK SOURCE PRIORS
Xi-Lin Li
GMEMS Technologies, Inc., 366 Fairview Way, Milpitas, CA 95035 (e-mail: [email protected])
ABSTRACT
This paper studies the density priors for independent vectoranalysis (IVA) with convolutive speech mixture separationas the exemplary application. Most existing source priorsfor IVA are too simplified to capture the fine structures ofspeeches. Here, we first time show that it is possible to ef-ficiently estimate the derivative of speech density with uni-versal approximators like deep neural networks (DNN) byoptimizing certain proxy separation related performance in-dices. Experimental results suggest that the resultant neuralnetwork density priors consistently outperform previous onesin convergence speed for online implementation and signal-to-interference ratio (SIR) for batch implementation.
Index Terms — Independent vector analysis (IVA), con-volutive speech separation, speech probability density, neuralnetwork, cocktail problem.
1. INTRODUCTION
Speech separation, also known as the cocktail problem, is afundamental signal processing task. Although there is a surgeof supervised neural network based speech separation stud-ies recently, the unsupervised approaches, e.g., independentcomponent analysis (ICA) based on the Infomax principle[1] and independent vector analysis (IVA) [2], are still at-tractive due to their simplicity and low complexity, and thewide availability of multichannel recordings on today’s enddevices like smart phones, tablets, personal computers, smartspeakers, and many more internet of things (IoT) devices.Probability density function (pdf) of speech is the key compo-nent driven the separation of mixtures in these unsupervisedframeworks. The most widely adopted pdf models for speechare the multivariate Laplace and generalized Gaussian dis-tributions [3, 4, 5], either in the time or frequency domain.Specifically, previously studied multivariate source priors forIVA include the Laplace prior [2], non-spherical priors rep-resented by chain-like overlapped cliques [6, 7], Student’s t-distribution [8], generalized Gaussian distribution (GGD) [9],complex Gaussian scale mixture (CGSM) [10] and Gaussianmixture model (GMM) priors [11]. However, most of thesesource priors are too simplified to capture the fine structuresof speeches. A finite mixture model (FMM) is expressiveenough, but could be too complicated due to the need to esti- mate those nuisance parameters of FMM. Actually, only theseparation of mixtures of two sources is considered with theGMM source prior in [11]. Density estimation for multivari-ate random variable is known to be a hard problem due to thecurse of dimensionality. Fortunately, as in most maximumlikelihood (ML) estimation problems, ICA or IVA for speechseparation only requires the derivative of density, which couldbe estimated with less difficulty in practice, as shown in thispaper. Here, we choose the IVA framework for speech sep-aration as it is implemented in the frequency domain, and iscomputationally cheaper than convolutive ICA implementedin the time domain. In the training phase, neural networks areused to approximate the derivative of speech density by opti-mizing certain proxy separation related objectives. In the testphase, these neural network source priors are fixed, and onlythe separation matrices are adapted. In this way, our sourcepriors are expressive enough, and yet, learning rules for up-dating the separation matrices are kept to be simple.Before ending the introduction, we would like to brieflysummarize the main contributions of our work and its relationto prior work. Our approach is not a supervised speech sepa-ration method, although both use the neural networks as uni-versal approximators. Supervised speech separation attractsa lot of attentions recently. It typically assumes that the trainand test mixtures are generated in similar fashions. Hence,the resultant black-box models can only be applied to veryspecific scenarios, e.g., the single and multiple channel sep-aration methods in [12] and [13], respectively. On the otherhand, IVA is a well formulated optimization problem. Thesame source prior can be useful in different mixing scenarios.One main contribution of this paper is to develop a practicalapproach for estimating the source priors in IVA with univer-sal approximators like neural networks. Another main contri-bution is to experimentally demonstrate the performance gainof the resultant neural network priors over previous ones in awide range of speech separation tasks.
2. BACKGROUND2.1. Mixing and Separation Models
We assume that there are N ≥ speech sources andmicrophones. Recording of the m th microphone is ex-pressed as x m ( i ) = (cid:80) Nn =1 (cid:80) Lj =0 a mn ( j ) s n ( i − j ) , where a r X i v : . [ ee ss . A S ] O c t ≤ m ≤ N , i and j are two discrete time indices, a mn ( i ) the room impulse response (RIR) from the n th source tothe m th receiver, L + 1 the length of RIR, and s n ( i ) the n th source signal. It is convenient to rewrite the mix-tures compactly as xxx ( i ) = (cid:80) Lj =0 AAA ( j ) sss ( i − j ) , where xxx ( i ) = [ x ( i ) , . . . , x N ( i )] T , sss ( i ) = [ s ( i ) , . . . , s N ( i )] T , AAA ( j ) the mixing matrix, and superscript T denotes transpose.Reversing the convolutive mixtures in the time domain canbe computationally expensive. Hence, it is more popular toconsider the mixing and separation models in the frequencydomain as XXX ( ω k , t ) = HHH ( ω k ) SSS ( ω k , t ) and YYY ( ω k , t ) = WWW ( ω k ) XXX ( ω k , t ) , where ≤ k ≤ K , K is the number offrequency bins, ω k the discrete angular frequency index, t the frame index, HHH ( ω k ) the mixing matrix, WWW ( ω k ) the sep-aration matrices, SSS ( ω k , t ) = [ S ( ω k , t ) , . . . , S N ( ω k , t )] T , XXX ( ω k , t ) = [ X ( ω k , t ) , . . . , X N ( ω k , t )] T , and YYY ( ω k , t ) =[ Y ( ω k , t ) , . . . , Y N ( ω k , t )] T . Clearly, the frequency resolu-tion need to be high enough in order to well approximate thelinear convolution in the time domain as K frequency domaininstantaneous mixing processes. Let
SSS n ( t ) = [ S n ( ω , t ) , S n ( ω , t ) , . . . , S n ( ω K , t )] T and YYY n ( t ) = [ Y n ( ω , t ) , Y n ( ω , t ) , . . . , Y n ( ω K , t )] T , where ≤ n ≤ N . Note that SSS m ( t ) and SSS n ( t ) are two inde-pendent complex valued source vectors for ≤ m (cid:54) = n ≤ N ,hence the name IVA. IVA further assumes that SSS n ( t ) and SSS n ( t ) are independent for t (cid:54) = t , although this might notbe true in reality. Then, we can write the pdf of the observedmixtures as p X [ XXX ( ω ) , . . . , XXX ( ω K )] = (cid:81) Nn =1 p S ( SSS n ) (cid:81) Kk =1 | det[ HHH ( ω k )] | (1)where | det( · ) | denotes the absolute determinant of a squarematrix, p S ( · ) the pdf of speech signal in the frequency do-main, and we have omitted the frame index t to simplify ourwriting. Hence, ML estimation for the separation matricesare given by the minimum of the following expected negativelogarithm likelihood (NLL) function J ( WWW ( ω ) , . . . , WWW ( ω K ))= E [ − log p X [ XXX ( ω ) , . . . , XXX ( ω K )] | WWW ( ω ) , . . . , WWW ( ω K )]= E [ − N (cid:88) n =1 log p S ( YYY n ) − K (cid:88) k =1 log | det[ WWW ( ω k )] | ] (2)Thus, IVA turns to a well defined optimization problem giventhe form of source prior, i.e., p S ( · ) . Natural or relative gradi-ent descent [15, 14] is the most popular optimization methodfor minimizing the NLL in (2). For batch processing, rela-tive Newton method [10, 16] and the auxiliary function tech-nique for spherical source priors [17] are shown to converge fast. Here, we choose natural gradient descent as the opti-mizer since it is suitable for both online and batch imple-mentations. The learning rate for separation matrices updat-ing is bin-wisely normalized as the method proposed in [18].Hence, the only left piece to be solved is the source priors.
3. DEEP NEURAL NETWORK PRIORS FOR IVA3.1. Neural Network Density Model for Speech
Let us suppress indices n and t , and simply write the densityof SSS = [ S ( ω ) , . . . , S ( ω K )] T as p ( SSS ) = p [ S ( ω ) , . . . , S ( ω K )] .It is reasonable to impose two regularities on the possi-ble forms of p ( SSS ) . First, SSS must be circular in the sensethat p ( SSS ) only depends on the amplitudes of S ( ω k ) for ≤ k ≤ K , but not their phases. Second, SSS must besparse, i.e., ∂p ( λSSS ) /∂λ ≤ for any SSS and λ > . Then, p ( SSS ) can only have form − log p ( SSS | θθθ ) = F ( | S ( ω ) | , . . . , | S ( ω K ) | , θθθ ) (3)where θθθ is a pdf parameter vector, and F ( · ) is a properly cho-sen function. Indeed, any such F ( · ) can define a valid pdfas long as exp( − F ) is integrable. The sparsity regularity re-quires that ∂F ( | S ( ω ) | , . . . , | S ( ω K ) | , θθθ ) ∂ | S ( ω k ) | ≥ , ≤ k ≤ K (4)Notice that minimizing the NLL in (2) only requires the fol-lowing derivative, − ∂ log p ( SSS | θθθ ) ∂S ∗ ( ω k ) = ∂F ( | S ( ω ) | , . . . , | S ( ω K ) | , θθθ ) ∂ | S ( ω k ) | S ( ω k ) (5)where superscript ∗ denotes conjugate. Thus, all we need arethe K derivatives in (4), which could be approximated using afeedforward neural network (FNN) with nonnegative outputs.It is also possible to consider the temporal dependenceamong successive frames from the same source signal.Specifically, for Markov sources, we have p ( SSS ( t ) | SSS ( t − , . . . , SSS (1) , θθθ ) = p ( SSS ( t ) | hhh ( t − , θθθ ) (6)where hhh ( t ) is a hidden state vector at time t . We could use arecurrent neural network (RNN) with K nonnegative outputsto model such densities as well. A neural network usually performs the best for normalizedinputs. Here, we define the normalized spectrum vector as ¯ SSS = SSS/ (cid:107)
SSS (cid:107) , where (cid:107)
SSS (cid:107) is the length of
SSS . Amplitudes ofits elements could be further compressed with an element-wise logarithm operation. We have tested the following neuralnetwork density model in our experiments − ∂ log p ( SSS | hhh, θθθ ) ∂SSS ∗ = log[1 + exp( γγγ )] (cid:12) ¯ SSS ith γγγ as the output of the following three layered network ααα ( t ) = tanh(ΘΘΘ [log | ¯ SSS ( t ) | ; log (cid:107) SSS ( t ) (cid:107) ; hhh ( t − βββ ( t ) = tanh(ΘΘΘ [ ααα ( t ); 1]) γγγ ( t ) = ΘΘΘ [ βββ ( t ); 1] (7)where { ΘΘΘ , ΘΘΘ , ΘΘΘ } are the model parameters, | · | takes theelement-wise absolute value, [ · ; · ] denotes stacking columnvectors vertically, and hidden state vector hhh ( t − is a subsetof ααα ( t − . Specifically, (7) defines a FNN when hhh ( t ) = [ ] ,and a RNN otherwise. The RNN model can only be usedto update the separation matrices sequentially by keeping thetemporal order, while the FNN one has no such limitation. Itis possible to consider more complicated priors. Nevertheless,these simple ones perform competitively in our experiments. The separation results are determined by the source priorsgiven the learning rules for separation matrices updating.Thus, it is possible to choose a proxy performance indexmeasuring the goodness of separation, and ‘learn’ the sourcepriors to optimize the chosen proxy objective. In our experi-ments, we choose the following average permutation invariant(PI) absolute coherence as this objective c ( θθθ ) = max π N K N (cid:88) n =1 K (cid:88) k =1 | E [ Y π ( n ) ( ω k , t ) S ∗ n ( ω k , t )] | (cid:112) E [ | Y π ( n ) ( ω k , t ) | ] E [ | S n ( ω k , t ) | ] (8)where π denotes an element of the set of all possible permu-tations of list [1 , , . . . , N ] , π ( n ) the n th element of permu-tation π , and we deliberately write c ( θθθ ) as a function of θθθ toshow its dependence on the source prior parameter vector θθθ .Similar PI objectives are used in supervised speech separationas well. Clearly, c ( θθθ ) is invariant to the scaling of separatedoutputs as well. In the training phase, the source signals areknown. Thus, given the form of a source prior, we can opti-mize its parameters by maximizing the objective in (8) withdeep learning tools like Pytorch. Such resultant estimatedsource prior implicitly defines a pdf suitable for the separa-tion of speech mixtures. Note that unlike a FMM, there is noneed to update the neural network priors in the test phase.
4. EXPERIMENTAL RESULTS
Computer program reproducing the results reported belowand sample separation results for subjective comparisons areavailable from our website . The training speeches are from a corpus of hour readLibriVox English books [19], and the test ones are from the https://github.com/lixilinx/IVA4Cocktail Time (s) S I R ( d B ) LaplaceGGDStudentNon-SphericalFNNRNN (a) Experiment 1
Number of sources S I R ( d B ) LaplaceGGDStudentNon-SphericalFNN (b) Experiment 2
Length of speech (s) S I R ( d B ) LaplaceGGDStudentNon-SphericalFNN (c) Experiment 3
Time (s) S I R ( d B ) LaplaceGGDStudentNon-SphericalFNNRNN (d) Experiment 4
Fig. 1 . Comparisons of different source priors in four separa-tion tasks. Results are averaged over independent runs.well known TIMIT corpus. All have the same samplingrate, , Hz. A short time Fourier transform (STFT)with frame size and hop size is used to convert thetime domain signals to the frequency domain with analysisand synthesis windows designed by the method from [20].This frequency resolution works well for separation of mix-tures with low to moderate reverberations. All the separationmatrices are initialized to the identity matrix.
We have prepared one FNN and one RNN source prior. Di-mensions of ααα and βββ in (7) are the same, . For the RNNmodel, the first elements of ααα serve as the hidden states.We always set N = 4 . Four randomly selected sources areartificially mixed as xxx ( i ) = (cid:80) j = − AAA ( j ) sss ( i − j ) / (1 + | j | ) ,where all the elements in AAA ( i ) are standard Gaussian randomvariables. The normalized learning rate in natural gradient de-scent is set to . . Absolute coherence in the proxy objectiveof (8) is estimated over frames. We choose to reset themixing matrices with a probability of . after each evalua-tion of proxy objective. The simulation batch size is set to .The preconditioned stochastic gradient method in [22] is usedto optimize the neural network coefficients with default stepsize . and a total of , iterations. The final convergedaverage absolute coherence is about . . The test speeches are convolutively mixed through randomlygenerated RIRs using the image source method [21]. Sizes ofhe simulated room are (Length = 5 , Width = 4 , Height =3) , all in meters. Locations of simulated microphones arerandomly and uniformly distributed inside of a sphere withradius . and centered at (2 , , . , while the positions ofsimulated speech sources are also equally distributed outsideof a sphere with radius and the same center location. To sim-ulate fractional delays, we first generate the RIRs with sam-pling rate , and then decimate them to sampling rate , . The wall reflection coefficients are set to . suchthat the typical converged signal-to-interference ratio (SIR)for the separation of two sources is about dB. This SIRnumber is also representative for IVA tested on real worldmixtures of two speeches recorded in living rooms with lowto moderate reverberations. We have designed four experiments to compare six source pri-ors for speech separation, i.e., the Laplace one [2], GGD [9],Student’s t-distribution [8], a non-spherical prior by groupingthe bins into four cliques of equal size in Mel scale [6], andour estimated FNN and RNN source priors. The scaling am-biguity is resolved by the minimum distortion principle [23].
Experiment 1 : This experiment benchmarks the conver-gence speed for online implementation. The separation matri-ces are updated once per frame with a fixed normalized learn-ing rate. Here, we set N = 2 , and the normalized learningrate to . . Experiment 2 : This experiment benchmarks the statisticalefficiency of different source priors in batch processing mode.Since the separation matrices are not necessarily updated se-quentially in the temporal order, the RNN source prior is notconsidered. The recording length is s. We vary the numberof sources. A total of , separation matrix updatings areperformed to ensure convergence before measuring the SIRperformance. The normalized learning rate starts from . ,and linearly reduces to . at the end of iteration. Experiment 3 : This is one more experiment comparing thestatistical efficiency of different source priors in batch pro-cessing mode. Unlike Experiment 2, we set N = 3 , and varythe length of speeches. We also find that it is necessary tohalve the initial normalized learning rate for the Student’s tsource prior to avoid occasional divergence. Other source pri-ors do not suffer from such issue. Experiment 4 : The last experiment compares the capacityof different source priors for correcting frequency permuta-tions. Prior work and our experiences suggest that IVA mightbe trapped in local minima [24], and thus fails to solve thefrequency permutation issue. One typical error pattern is tomix one source’s high frequency band with another’s low fre-quency band in a single separated output. Unfortunately, theSIR performance index is insensitive to such errors as mostspeech energy concentrates in low frequency band. To reli-ably reproduce this misbehavior, we consider a simple × artificial mixing system consisting of low and high pass But-terworth filters as AAA ( z ) = (cid:20) (1 + z − ) (1 − z − ) (1 − z − ) (1 + z − ) (cid:21) / (1 +0 . z − ) . High frequency energy is emphasized by passingthe outputs through filter − z − before measuring the SIR.Other settings are the same as that of Experiment 1.Fig. 1 summarize the experimental results. Experiment1 suggests that the neural network priors lead to the fastedconvergence. The RNN model only delivers a marginal per-formance gain over the FNN one. The Student’s t prior per-forms the best among those simple ones, confirming the ob-servations in [8]. Both Experiment 2 and 3 suggest that theFNN source prior is significantly more efficient than previ-ous ones for speech separation when the number of sources islarge or length of speech is short. Among those simple priors,Laplace and GGD show similar performance. Still, the GGDprior seems perform slightly better than the Laplace one. Thisobservation is consistent with those in [9]. The non-sphericalsource prior performs better than other simple ones only whenthe length of speech is small. Its performance might be sen-sitive to the definition of cliques [6, 7], and our definition isnot necessarily optimal for all these tasks. Performance of theStudent’s t prior can be improved with smaller learning ratesand more iterations. But, it is still less competitive than othersimple ones in Experiments 2 and 3. Lastly, Experiment 4suggests that only the neural network source priors are ableto solve the low and high frequency bands permutation issue.This is not astonishing since none of the other simple sourcepriors can capture the fine structures of speeches.
5. CONCLUSION
Separation of speech mixtures is a longstanding challengingsignal processing problem. Speech density model is the keycomponent in unsupervised separation frameworks like theindependent vector analysis (IVA). In this paper, we haveshown that it is possible to efficiently estimate the derivativeof density of speeches represented in the frequency domain byoptimizing certain separation related proxy objectives like theabsolute coherence between source signals and separated out-puts. Specifically, we have considered neural network speechdensity priors with heuristic design constraints like circular-ity and sparsity. Experimental results confirm that these deepneural network source priors considerably outperform pre-vious ones in convergence speed for online implementationsand statistical efficiency in batch processing mode.
6. REFERENCES [1] K. Torkkola, “Blind separation of convolved sourcesbased on information maximization,” in
IEEE WorkshopNeural Networks for Signal Processing , Kyoto, Japan,Sept. 1996.2] T. Kim, I. Lee, and T.-W. Lee, “Independent vector anal-ysis: definition and algorithms,” in
Asilomar Conferenceon Signals, Systems & Computers , Pacific Grove, CA,USA, Oct. 2006.[3] S. Gazor and W. Zhang, “Speech probability distribu-tion,”
IEEE Signal Processing Letters , vol. 10, no. 7, pp.204–207, Jul. 2003.[4] T. Eltoft, T. Kim, and T.-W. Lee, “On the multivariateLaplace distribution,”
IEEE Signal Processing Letters ,vol. 13, no. 5, pp. 300–303, May 2006.[5] A. Aroudi, H. Veisi, H. Sameti, and Z. Mafakheri,“Speech signal modeling using multivariate distribu-tions,”
EURASIP Journal on Audio, Speech, and MusicProcessing , vol. 35, Dec. 2015.[6] I. Lee, G.-J. Jang, and T.-W. Lee, “Independent vectoranalysis using densities represented by chain-like over-lapped cliques in graphical models for separation of con-volutedly mixed signals,“
Electronic Letters , vol. 45, no.13, pp. 710–711, Jun. 2009.[7] C. H. Choi, W. Chang and S. Y. Lee, “Blind sourceseparation of speech and music signals using harmonicfrequency dependent independent vector analysis,”
Elec-tronics Letters , vol. 48, no. 2, pp. 124–125, Jan. 2012.[8] J. Harris, B. Rivet, S. M. Naqvi, J. A. Chambers, and C.Jutten, “Real-time independent vector analysis with stu-dent’s t source prior for convolutive speech mixtures,” in
IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , South Brisbane, Queens-land, Australia, Apr. 2015, pp. 1856–1860.[9] Y. Liang, J. Harris, S. M. Naqvi, G. Chen, and J. A.Chambers, “Independent vector analysis with a gener-alized multivariate Gaussian source prior for frequencydomain blind source separation,”
Signal Processing , vol.105, pp. 175–184, Dec. 2014.[10] J. A. Palmer, K. K. Delgado, and S. Makeig, “Proba-bilistic formulation of independent vector analysis usingcomplex Gaussian scale mixtures,” in
International Con-ference on Independent Component Analysis and SignalSeparation , Paraty, Brazil, Mar. 2009, pp. 90–97.[11] J. Hao, I. Lee, T.-W. Lee and T. J. Sejnowski, “Indepen-dent vector analysis for source separation using a mixtureof Gaussians prior,”
Neural Computation , vol. 22, no. 6,pp. 1646–1673, Jun. 2010.[12] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe,“Deep clustering: discriminative embeddings for seg-mentation and separation,” in
IEEE International Confer-ence on Acoustics, Speech and Signal Processing , Shang-hai, China, Mar. 2016. [13] J. Zhang, C. Zorila, R. Doddipatla, and J. Barker, “Onend-to-end multi-channel time domain speech separa-tion in reverberant environments,” in
IEEE InternationalConference on Acoustics, Speech and Signal Processing ,Barcelona, Spain, May 2020.[14] J.-F. Cardoso and B. Laheld, “Equivariant adaptivesource separation,”
IEEE Transactions on Signal Pro-cessing , vol. 44, no. 12, pp. 3017–3030, Dec. 1996.[15] S. Amari, A. Cichocki, and H. H. Yang, “A new learn-ing algorithm for blind signal separation,” in
Advancesin Neural Information Processing Systems 1995 , Boston,MA, 1996, pp. 752–763. MIT Press.[16] P. Wang, J. Li and H. Zhang, “Decoupled independentvector analysis algorithm for convolutive blind sourceseparation without orthogonality constraint on the demix-ing matrices,”
Mathematical Problems in Engineering ,vol. 2018, Nov. 2018.[17] N. Ono, “Stable and fast update rules for independentvector analysis based on auxiliary function technique,” in
IEEE Workshop on Applications of Signal Processing toAudio and Acoustics , New Paltz, NY, USA, Oct. 2011.[18] Y. Tang and J. Li, “Normalized natural gradient in inde-pendent component analysis,”
Signal Processing , vol. 90,no. 9, pp. 2773–2777, Sept. 2010.[19] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,“Librispeech: an ASR corpus based on public domain au-dio books,” in
IEEE International Conference on Acous-tics, Speech and Signal Processing , Brisbane, QLD, Aus-tralia, 2015.[20] X.-L. Li, “Periodic sequences modulated filter banks,”
IEEE Signal Processing Letters , vol. 25, no. 4, pp. 576–580, Apr. 2018.[21] J. B. Allen and D. A. Berkley, “Image method for ef-ficiently simulating small-room acoustics,”,
The Journalof the Acoustical Society of America , vol. 65, no. 4, pp.943–950, Apr. 1979.[22] X.-L. Li, “Preconditioned stochastic gradient descent,”
IEEE Transactions on Neural Networks and LearningSystems , vol. 29, no. 5, pp. 1454–1466, May 2018.[23] K. Matsuoka and S. Nakashima, “Minimal distortionprinciple for blind source separation,” in
Proceedings ofInternational Symposium on ICA and Blind Signal Sepa-ration , San Diego, CA, USA, Dec. 2001.[24] X.-L. Li, T. Adali, and M. Anderson, “Joint blind sourceseparation by generalized joint diagonalization of cumu-lant matrices,”