[PDF] Measure Transformed Independent Component Analysis

Abstract

In this paper we derive a new framework for independent component analysis (ICA), called measure-transformed ICA (MTICA), that is based on applying a structured transform to the probability distribution of the observation vector, i.e., transformation of the probability measure defined on its observation space. By judicious choice of the transform we show that the separation matrix can be uniquely determined via diagonalization of several measure-transformed covariance matrices. In MTICA the separation matrix is estimated via approximate joint diagonalization of several empirical measure-transformed covariance matrices. Unlike kernel based ICA techniques where the transformation is applied repetitively to some affine mappings of the observation vector, in MTICA the transformation is applied only once to the probability distribution of the observations. This results in performance advantages and reduced implementation complexity. Simulations demonstrate the advantages of the proposed approach as compared to other existing state-of-the-art methods for ICA.

Full PDF

MMeasure Transformed Independent Component Analysis

Koby Todros [email protected]

Department of Electrical and Computer EngineeringBen-Gurion University of the NegevBeer-Sheva 84105, Israel

Alfred O. Hero [email protected]

Department of Electrical Engineering and Computer ScienceUniversity of MichiganAnn-Arbor 48105, MI, U.S.A

Editor:

Abstract

In this paper we derive a new framework for independent component analysis (ICA), calledmeasure-transformed ICA (MTICA), that is based on applying a structured transform tothe probability distribution of the observation vector, i.e., transformation of the proba-bility measure deﬁned on its observation space. By judicious choice of the transform weshow that the separation matrix can be uniquely determined via diagonalization of severalmeasure-transformed covariance matrices. In MTICA the separation matrix is estimatedvia approximate joint diagonalization of several empirical measure-transformed covariancematrices. Unlike kernel based ICA techniques where the transformation is applied repeti-tively to some aﬃne mappings of the observation vector, in MTICA the transformation isapplied only once to the probability distribution of the observations. This results in per-formance advantages and reduced implementation complexity. Simulations demonstratethe advantages of the proposed approach as compared to other existing state-of-the-artmethods for ICA.

Keywords:

Approximate joint diagonalization, blind source separation, independentcomponent analysis, probability measure transform.

1. Introduction

Independent component analysis (ICA) is a technique for multivariate data analysis thataims at decomposing an observed random vector into linear combination of mutually inde-pendent random variables (Common, 1994; Hyv¨arinen et al., 2001). The observation vectoris assumed to be generated by an unknown linear mixture of mutually independent latentvariables, called sources, with unknown distributions. The coeﬃcient matrix of the linearmixture is called the mixing matrix and assumed to be invertible. Given a sequence of i.i.d.samples from the distribution of the observed vector, ICA aims to estimate the inverse of themixing matrix, called the separation matrix, that is used for recovering the sources. Unlikeprincipal component analysis, ICA can deal with a general mixing structure, which is notconstrained to be orthogonal. The mutual independence assumption is plausible in a widevariety of ﬁelds, including telecommunications (Qin et al., 2009; Zhao et al., 2009), ﬁnance(Back and Weigend, 1997; Liu et al., 2009), and biomedical signal analysis (Makeig et al.,1 a r X i v : . [ s t a t . M E ] D ec

2. Independent component analysis: Review

Let X = [ X , . . . , X p ] T denote a random vector, whose observation space is X ⊆ R p . Wedeﬁne the measure space ( X , S X , P X ), where S X is a σ -algebra over X , and P X is the jointprobability measure on S X . Let X k denote the observation space of X k . The marginalprobability measure of P X on S X k is denoted by P X k , were S X k is the σ -algebra over X k .Let g ( · ) denote an integrable scalar function on X . The expectation of g ( X ) under P X isdeﬁned as E [ g ( X ) ; P X ] (cid:44) (cid:90) X g ( x ) dP X ( x ) , (1)where x ∈ X . The components of X are mutually independent under P X ifE [ g ( X j ) h ( X k ) ; P X ] = E (cid:2) g ( X j ) ; P X j (cid:3) E [ h ( X k ) ; P X k ] ∀ j (cid:54) = k, (2)for all integrable scalar functions g ( · ), h ( · ) on X . The components of X are mutuallyuncorrelated under P X ifE [ X j X k ; P X ] = E (cid:2) X j ; P X j (cid:3) E [ X k ; P X k ] ∀ j (cid:54) = k. (3)4he empirical distribution, ˆ P X , based on a sequence of samples X n , n = 1 , . . . , N , is speciﬁedby ˆ P X ( A ) = 1 N N (cid:88) n =1 δ X n ( A ) , (4)where A ∈ S X and δ X n ( · ) is the Dirac probability measure at X n (Folland, 1984). The instantaneous noiseless ICA model takes the following form: X = AS , (5)where X ∈ R p , p ≥

2, is an observed random vector, A ∈ R p × p is an invertible unknown ma-trix, called the mixing matrix, and S ∈ R p is a latent random vector comprised of mutuallyindependent variables having ﬁnite second-order moments and unknown distributions. Thecomponents of S are also called sources. Under the model (5) it has been shown that themixing matrix A can be uniquely identiﬁed, up to permutation and scaling of its columns,if and only if at most one of the sources is Gaussian (Kagan et al., 1973; Common, 1994;Hyv¨arinen et al., 2001; Errikson and Koivunen, 2004). Given a sequence of i.i.d. samplesfrom P X , ICA aims to estimate the separation matrix B = A − and thus, recover thesources using the relation S = BX .Many state-of-the-art ICA algorithms, such as JADE, FICA, EFICA, EIMAX, KGVand RADICAL, referenced in Section 1, apply whitening to the observed vector X . Thewhitened observation vector is represented as Z (cid:44) WX = US , (6)where W ∈ R p × p is the whitening matrix and U (cid:44) WA . Assuming, without loss ofgenerality, that the components of S have unit variances, one can easily verify that thematrix U is unitary leading to a unitary mixing model. Let V (cid:44) U T , where ( · ) T denotesthe transpose operator. ICA algorithms that use whitening implement an estimate of V using constraint optimization over the Stiefel manifold of unitary matrices (Edelman et al.,1999). The empirical separation matrix is then obtained using the relation B = VW .

3. Measure transformed ICA

In this section the MTICA procedure is presented. First, a transform that maps a proba-bility measure P X into a set of probability measures (cid:110) Q ( u ) X (cid:111) on S X is deﬁned that has theproperty that it preserves mutual independence between the components of X under P X .Second, we deﬁne the measure-transformed covariance and derive its strongly consistentestimate, which is also shown to be Fisher consistent (Cox and Hinkley, 1974). Robust-ness of the empirical measure-transformed covariance to outliers is studied by analyzing theboundedness of its inﬂuence function (Hampel, 1974). Finally, based on the mixing models(5), (6), the MTICA procedure is speciﬁed by performing approximate joint diagonalizationof a set of empirical measure-transformed covariance matrices.5 .1 Probability measure transformDeﬁnition 1. Given a probability measure P X and a non-negative function u : R p → R + satisfying u ( X ) = p (cid:89) k =1 u k ( X k ) , u k : R → R + , k = 1 , . . . , p, (7) and < E [ u ( X ) ; P X ] < ∞ , (8) a transform on P X is deﬁned via the following relation: Q ( u ) X ( A ) (cid:44) T u [ P X ] ( A ) = (cid:90) A ϕ u ( x ) dP X ( x ) , (9) where A ∈ S X , x = [ x , . . . , x p ] T ∈ X , and ϕ u ( x ) (cid:44) u ( X )E [ u ( X ) ; P X ] . (10) The function u ( · ) , associated with the transform T u [ · ] , is called the MT-function. In the following Proposition, some properties of the measure transform (9) are given.

Proposition 1.

Let Q ( u ) X be deﬁned by relation (9). Then (1) Q ( u ) X is a probability measure on S X . (2) Q ( u ) X is absolutely continuous w.r.t. P X , with Radon-Nikodym derivative (Folland, 1984)given by dQ ( u ) X ( x ) dP X ( x ) = ϕ u ( x ) . (11)(3) Assume that the MT-function u ( · ) is strictly positive, then P X is absolutely continuousw.r.t. Q ( u ) X with a strictly positive Radon-Nikodym derivative given by dP X ( x ) dQ ( u ) X ( x ) = ϕ − u ( x ) = u − ( x )E (cid:104) u − ( X ) ; Q ( u ) X (cid:105) . (12)(4) If X , . . . , X p are mutually independent under P X , then they are mutually independentunder Q ( u ) X .[A proof is given in Appendix A] The probability measure Q ( u ) X is said to be generated by the MT-function u ( · ). Bymodifying u ( · ), such that the conditions (7), (8) are satisﬁed, virtually any probabilitymeasure on S X can be obtained. 6 .2 The measure-transformed covariance According to (1) and (11) the measure-transformed covariance of X under Q ( u ) X is given by Σ ( u ) X = E (cid:2) XX T ϕ u ( X ) ; P X (cid:3) − µ ( u ) X µ ( u ) T X , (13)where µ ( u ) X (cid:44) E [ X ϕ u ( X ) ; P X ] (14)is the measure-transformed expectation of X under Q ( u ) X . Equation (13) implies that Σ ( u ) X is a weighted covariance matrix of X under P X , with weighting function ϕ u ( · ). Hence, Σ ( u ) X can be estimated using only samples from the distribution P X . By modifying theMT-function u ( · ), such that the conditions (7), (8) are satisﬁed, the MT-covariance matrixunder Q ( u ) X is modiﬁed. In particular, by choosing u ( x ) ≡

1, we have Q ( u ) X = P X , and thestandard covariance matrix is obtained.In the following Proposition a strongly consistent estimate of the measure-transformedcovariance is given that is based on i.i.d. samples from the probability distribution P X . Proposition 2.

Let X n , n = 1 , . . . , N denote a sequence of i.i.d. samples from the distri-bution P X , and deﬁne the empirical covariance estimate ˆ Σ ( u ) X (cid:44) N (cid:88) n =1 X n X Tn ˆ ϕ u ( X n ) − ˆ µ ( u ) x ˆ µ ( u ) T x , (15) where ˆ µ ( u ) X (cid:44) N (cid:88) n =1 X n ˆ ϕ u ( X n ) , (16) and ˆ ϕ u ( X n ) (cid:44) u ( X n ) N (cid:80) n =1 u ( X n ) . (17) Assume E (cid:2) u ( X ) ; P X (cid:3) < ∞ and E (cid:2) X k ; P X (cid:3) < ∞ ∀ k = 1 , . . . , p. (18) Then ˆ Σ ( u ) X → Σ ( u ) X almost surely as N → ∞ . [The proof is similar to the proof of Proposition3 in (Todros and Hero, 2012) and therefore is omitted] Note that for u ( X ) ≡ NN − ˆ Σ ( u ) X reduces to the standard unbiased esti-mator of the covariance matrix Σ X . Also notice that ˆ Σ ( u ) X can be written as a statisticalfunctional H u [ · ] of the empirical probability measure ˆ P X (4), i.e.,ˆ Σ ( u ) X = E (cid:104) XX T u ( X ) ; ˆ P X (cid:105) E (cid:104) u ( X ) ; ˆ P X (cid:105) − E (cid:104) X u ( X ) ; ˆ P X (cid:105) E (cid:104) X T u ( X ) ; ˆ P X (cid:105) E (cid:104) u ( X ) ; ˆ P X (cid:105) (cid:44) H u (cid:104) ˆ P X (cid:105) . (19)According to (10), (13), (14), and (19), when ˆ P X is replaced by the true probability measure P X we have H u [ P X ] = Σ ( u ) X , which implies that ˆ Σ ( u ) X is Fisher consistent (Cox and Hinkley,1974). 7 .3 Robustness of the empirical MT-covariance to outliers Here, we study the robustness of the empirical MT-covariance ˆ Σ ( u ) X to outliers using itsinﬂuence function. Deﬁne the probability measure P (cid:15) (cid:44) (1 − (cid:15) ) P X + (cid:15)δ y , where 0 ≤ (cid:15) ≤ y ∈ R p , and δ y is a Dirac probability measure at y . The inﬂuence function of a Fisherconsistent estimator with statistical functional H [ · ] at probability distribution P X is deﬁnedpointwise as (Hampel, 1974):IF H ,P X ( y ) (cid:44) lim (cid:15) → H [ P (cid:15) ] − H [ P X ] (cid:15) = ∂ H [ P (cid:15) ] ∂(cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 . (20)The inﬂuence function describes the eﬀect on the estimator of an inﬁnitesimal contaminationat the point y . An estimator is said to be B-robust if its inﬂuence function is bounded. Using(10), (13), (14), (19) and (20) one can verify that the inﬂuence function of the empiricalMT-covariance is given byIF H u ,P X ( y ) = u ( y )E [ u ( X ) ; P X ] (cid:18)(cid:16) y − µ ( u ) X (cid:17) (cid:16) y − µ ( u ) X (cid:17) T − Σ ( u ) X (cid:19) . (21)The following proposition states a suﬃcient condition for boundedness of (21). Proposition 3.

The inﬂuence function (21) is bounded if the MT-function u ( y ) is bounded,and there exists a constant c > such that u ( y ) ≤ c (cid:107) y (cid:107) − , where (cid:107) · (cid:107) denotes the l -norm.[A proof is given in Appendix B] In Section 5 we show that MT-functions chosen from the Gaussian family of functionssatisfy these conditions, resulting in a measure-transformed ICA algorithm that is resilientto outliers.

In MTICA we choose a sequence of MT-functions u m ( · ), m = 1 , . . . , M that satisﬁes atleast one of the following conditions:(1) Under the ICA model (5) the separation matrix B is the unique matrix (up to per-mutation and scaling of its rows) that jointly diagonalizes the MT-covariance matrices Σ ( u m ) X , m = 1 , . . . , M .(2) Under the unitary mixing model (6) the matrix V = U T is the unique matrix (up topermutation and sign of its rows) that jointly diagonalizes the MT-covariance matrices Σ ( u m ) Z , m = 1 , . . . , M .When the ﬁrst condition is satisﬁed, the separation matrix B is estimated via NOAJDof the empirical MT-covariances ˆ Σ ( u m ) X , m = 1 , . . . , M . The NOAJD (Pham, 2001; Yere-dor, 2002; Ziehe et al., 2004; Vollgraf and Obermayer, 2006; Fadaili et al., 2007; Li andZhang, 2007; Todros and Tabrikian, 2010) seeks a non-singular matrix ˆ B ∈ R p × p , suchthat ˆ B ˆ Σ ( u m ) X ˆ B T , m = 1 , . . . , M are “as diagonal as possible” in the sense that a deviationmeasure from diagonality is minimized. The MTICA procedure in this case is summarizedin Algorithm 1. 8 lgorithm 1 MTICA with no whitening

Input:

A sequence of data samples X n , n = 1 , . . . , N . Choose a sequence of MT-functions u m ( · ), m = 1 , . . . , M , such that B is the uniquejoint diagonalization marix of Σ ( u m ) X , m = 1 , . . . , M . Using (15)-(17) derive the empirical MT-covariances ˆ Σ ( u m ) X , m = 1 , . . . , M . Find the NOAJD matrix ˆ B of ˆ Σ ( u m ) X , m = 1 , . . . , M . Output:

The empirical separation matrix ˆ B .Alternatively, when the second condition is satisﬁed the observations are whitened, andthe estimate of V is obtained via OAJD of the empirical MT-covariance matrices ˆ Σ ( u m )ˆ Z , m = 1 , . . . , M , where ˆ Z (cid:44) ˆ WX and ˆ W is the empirical whitening matrix. The OAJD (Fluryand Gautschi, 1986; Cardoso and Souloumiac, 1996) seeks a unitary matrix ˆ V ∈ R p × p , suchthat ˆ V ˆ Σ ( u m )ˆ Z ˆ V T , m = 1 , . . . , M are “as diagonal as possible” by, once again, minimizing adeviation measure from diagonality. The empirical separation matrix is obtained by takingˆ B = ˆ V ˆ W . The MTICA procedure in this case is summarized in Algorithm 2. Algorithm 2

MTICA with whitening

Input:

A sequence of data samples X n , n = 1 , . . . , N . Choose a sequence of MT-functions u m ( · ), m = 1 , . . . , M , such that V is the uniquejoint diagonalization matrix of Σ ( u m ) Z , m = 1 , . . . , M . Estimate the whitening matrix ˆ W . Generate the sequence ˆ Z n = ˆ WX n , n = 1 , . . . , N . Using (15)-(17) derive the empirical MT-covariances ˆ Σ ( u m )ˆ Z , m = 1 , . . . , M . Find the OAJD matrix ˆ V of ˆ Σ ( u m )ˆ Z , m = 1 , . . . , M . Output:

Obtain an estimate of B by taking ˆ B = ˆ V ˆ W .By modifying the MT-functions such that the stated conditions are satisﬁed a family ofmeasure-transformed independent component analyses can be obtained. Particular choicesof MT-functions leading to the exponential and Gaussian MTICA algorithms are discussedin the next sections.

4. Exponential-MTICA

In this section we parameterize the MT-function u ( · ; t ), with scaling parameter t ∈ R p underthe exponential family of functions. Under this choice of MT-function the MT-covarianceis given by the Hessian of the cumulant-generating-function resulting in the ICA algorithmproposed in (Yeredor, 2000). Let u E ( · ; · ) be deﬁned as the parameterized function u E ( x ; t ) (cid:44) exp (cid:0) t T x (cid:1) , (22)9here t ∈ R . Using (10), (13) and (22) one can easily verify that the covariance matrix of X under Q ( u E ) X takes the form Σ ( u E ) X ( t ) = ∂ log M X ( t ) ∂ t ∂ t T , (23)where M X ( t ) (cid:44) E (cid:2) exp (cid:0) t T X (cid:1) ; P X (cid:3) (24)is the moment generating function of X , and it is assumed that M X ( t ) is ﬁnite in someopen region in R p containing the origin. Note that the covariance matrix in (23) involveshigher-order statistics of X . Additionally, observe that Σ ( u E ) X ( t ) reduces to the standardcovariance matrix Σ X for t = .In the following lemma, directly following from (10)-(12) and the deﬁnition of the expo-nential MT-function (22), one sees that the Gaussian family of probability measures is closedunder the measure transformation (9) when generated by the exponential MT-function. Lemma 1.

A random vector X is Gaussian under the probability measure P X if and onlyif it remains Gaussian under the transformed probability measure Q ( u E ) X generated by theexponential MT-function u E ( · ; · ) . This property is used in proving the following theorem that states a necessary andsuﬃcient condition for Gaussianity of a random variable X based on its exponential MT-variance. Theorem 1.

A random variable X with corresponding probability measure P X is Gaussianif and only if the the exponential MT-variance satisﬁes σ ( u E ) X ( t ) = c ∀ t ∈ ( t − (cid:15), t + (cid:15) ) , (25) where c and (cid:15) are some positive constants and t is an arbitrary point in R . [A proof isgiven in Appendix C] Hence, if a random variable X is non-Gaussian then its exponential MT-variance σ ( u E ) X ( t )cannot be constant over any open interval. This property is used in the following subsectionto establish identiﬁability of the mixing matrix A . Using (5), (10), (13) and (22) it can be shown that for any choice of the scaling parameter t the exponential MT-covariance of the observation vector X has the following structure: Σ ( u E ) X ( t ) = AΣ ( u E ) S (cid:0) A T t (cid:1) A T , (26)where Σ ( u E ) S ( · ) is the covariance matrix of the latent vector S under the transformed prob-ability measure Q ( u E ) S . Since the components of S are mutually independent under P S , byProperty 4 in Proposition 1, they are mutually independent under Q ( u E ) S , and therefore, Σ ( u E ) S ( · ) must be diagonal. Thus, the following property follows directly from (26):10 roposition 4. Let t and t , t (cid:54) = t , denote two arbitrary points in R p . Assume that (1) The matrices Σ ( u E ) S (cid:0) A T t (cid:1) , and Σ ( u E ) S (cid:0) A T t (cid:1) have ﬁnite diagonal entries, (2) The diagonal entries of Σ ( u E ) S (cid:0) A T t (cid:1) are non-zero, and (3) The matrix Λ ( u E ) S (cid:0) A T t , A T t (cid:1) (cid:44) Σ ( u E ) S (cid:0) A T t (cid:1) Σ ( u E ) − S (cid:0) A T t (cid:1) has distinct diagonalentries, i.e., no pair of diagonal entries have the same value.Then, A can be uniquely identiﬁed, up to scaling and permutation of its columns, by solvingthe following non-symmetric eigenvalue decomposition problem: Σ ( u E ) X ( t ) Σ ( u E ) − X ( t ) A = AΛ ( u E ) S (cid:0) A T t , A T t (cid:1) . (27) [A proof is given in (Yeredor, 2000)]. As a result of property (25) of the exponential MT-variance, the following Theoremshows that Assumption 3 in Proposition 4 is satisﬁed almost everywhere (a.e.) if at mostone of the components of S is Gaussian. Theorem 2.

If at most one of the sources is Gaussian, then for t (cid:54) = t the matrix Λ ( u E ) S (cid:0) A T t , A T t (cid:1) has distinct diagonal entries a.e. [A proof is given in Appendix D] According to (26), Proposition 4, and Theorem 2, the separation matrix B = A − is theunique matrix that jointly diagonalizes two exponential MT-covariance matrices Σ ( u E ) X ( t )and Σ ( u E ) X ( t ) that satisfy the stated assumptions. The exponential-MTICA algorithm(Yeredor, 2000) is obtained by replacing the MT-functions u m ( · ), m = 1 , . . . , M , in Algo-rithm 1 with a sequence of exponential MT-functions u E ( · ; t m ), m = 1 , . . . , M . A procedurefor choosing the test-points t m ∈ R p , m = 1 , . . . , M , is given in Appendix G.1. Clearly,only two test-points are needed for obtaining a viable estimate of B . However, in order toincrease statistical stability and reduce the eﬀect of ill-conditioned empirical MT-covariancematrices it is better to use a sequence of more than two test-points.

5. Gaussian-MTICA

In this section we parameterize the MT-function u ( · ; t , τ ), with translation parameter t ∈ R p and width parameter τ ∈ R ∗ + using a Gaussian family of functions. Under the unitary mixingmodel (6) we show that if at most one of the sources is Gaussian, the mixing matrix U can be uniquely identiﬁed via eigenvalue decomposition of a single Gaussian MT-covariancematrix. Based on this result the Gaussian-MTICA algorithm is obtained that applies OAJDto a sequence of empirical Gaussian MT-covariance matrices. We deﬁne the Gaussian MT-function u G ( · ; · , · ) as u G ( x ; t , τ ) (cid:44) exp (cid:32) − (cid:107) x − t (cid:107) τ (cid:33) , (28)11here t ∈ R p , and τ ∈ R ∗ + . Since u G ( · ; · , · ) is strictly positive and bounded, one caneasily verify that the condition (8) is always satisﬁed. Relations (10) and (13) imply thatthe MT-function (28) produces a weighted covariance matrix, Σ ( u G ) X ( t , τ ), for which theobservations are weighted in inverse proportion to the distance (cid:107) x − t (cid:107) . This results ina kind of local covariance analysis of X in the vicinity of the test-point t . Notice thatthe Gaussian MT-function (28) satisﬁes the conditions in Proposition 3, and therefore, theinﬂuence function of the empirical Gaussian MT-covariance is bounded. Hence, unlike theempirical exponential MT-covariance, whose inﬂuence function is unbounded, the empiricalGaussian MT-covariance is robust to outlying observations.Similarly to the exponential measure transformation, the following lemma, directly fol-lowing from (10)-(12) and the deﬁnition of the Gaussian MT-function (28), states that theGaussian family of probability measures is closed under the measure transformation (9)generated by the Gaussian MT-function. Lemma 2.

A random vector X is Gaussian under the probability measure P X if and onlyif it remains Gaussian under the transformed probability measure Q ( u G ) X generated by theGaussian MT-function u G ( · ; · , · ) . This property is used in proving the following theorem that states a necessary andsuﬃcient condition for Gaussianity of a random variable X based on its Gaussian MT-variance. Theorem 3.

A random variable X with corresponding probability measure P X is Gaussianif and only if the the Gaussian MT-variance satisﬁes σ ( u G ) X ( t, τ ) = c ∀ t ∈ ( t − (cid:15), t + (cid:15) ) , (29) where c and (cid:15) are some positive constants and t is some arbitrary point in R [A proof isgiven in Appendix E]. Hence, similarly to the exponential MT-variance, if a random variable X is non-Gaussianthen for any choice of the width parameter τ ∈ R ∗ + the Gaussian MT-variance σ ( u G ) X ( t, τ )cannot be constant w.r.t. t over any open interval. This property is used in the followingsubsection for proving identiﬁability of the mixing matrix U . According to (6), (10), (13) and (28) the MT-covariance of the whitened observation vector Z under Q ( u G ) Z has the following structure: Σ ( u G ) Z ( t , τ ) = UΣ ( u G ) S (cid:0) U T t , τ (cid:1) U T , (30)where Σ ( u G ) S ( · , · ) is the covariance matrix of S under the transformed probability measure Q ( u G ) S . Since the components of S are mutually independent under P S , then by Property 4in Proposition 1 they are mutually independent under Q ( u G ) S , and thus, Σ ( u G ) S ( · , · ) must bediagonal. Therefore, assuming that Σ ( u G ) S (cid:0) U T t , τ (cid:1) has distinct ﬁnite diagonal entries, the12nitary matrix U can be uniquely identiﬁed (up to permutation and sign of its columns)via eigenvalue decomposition of the Gaussian MT-covariance Σ ( u G ) Z ( t , τ ).Based on property (29) of the Gaussian MT-variance, the following theorem states thatif at most one of the components of S is Gaussian, then Σ ( u G ) S (cid:0) U T t , τ (cid:1) has distinct diagonalentries for almost every t ∈ R p . Theorem 4.

If at most one of the sources is Gaussian, then the matrix Σ ( u G ) S (cid:0) U T t , τ (cid:1) hasdistinct diagonal entries a.e. [A proof is given in Appendix F] According to (30) and Theorem 4, if at most one of the sources is Gaussian, then for al-most every t ∈ R p the matrix V = U T is the unique diagonalizing matrix of the GaussianMT-covariance Σ ( u G ) Z ( t , τ ). Thus, the Gaussian-MTICA algorithm is implemented by re-placing the MT-functions u m ( · ), m = 1 , . . . , M in Algorithm 2 with Gaussian MT-functions u G ( · ; t m , τ ), m = 1 , . . . , M , where the width parameter τ ∈ R ∗ + is ﬁxed. A procedure forchoosing the test-points t m ∈ R p , m = 1 , . . . , M is given in Appendix G.2. Clearly, only onetest-point is needed for estimating V . However, estimation of V based on diagonalizationof a single empirical Gaussian MT-covariance has the following drawbacks: (1) For somechoice of the translation parameter t the eigen-spectrum of the corresponding GaussianMT-covariance may be degenerate, i.e., the eigenvalues may not be well separated. (2) Asingle Gaussian MT-covariance may only capture part of the statistical information about Z necessary to separate the sources eﬀectively. In order to alleviate these drawbacks it maybe better to use more than a single test-point.

6. Comparisons between exponential and Gaussian MTICA

Unlike Gaussian-MTICA that requires whitening, which under the model (5) leads to uni-tary mixing, exponential-MTICA does not require whitening. Therefore, as illustrated inSubsection 8.3, exponential-MTICA is more robust to cases where the whitened observa-tions are poorly modeled by unitary mixing. Moreover, in Gaussian-MTICA, one has to seta width parameter τ not required for exponential-MTICA.On the other hand, unlike the exponential MT-function, the Gaussian MT-function isbounded and isotropically de-emphasizes samples distant from its location parameter. Thisproperty leads to the following advantages of Gaussian-MTICA over exponential-MTICA:(1) As illustrated in Subsections 8.1 and 8.2, Gaussian-MTICA is more robust to heavy-tailed distributions and outliers than exponential-MTICA. (2) Unlike the exponential MT-covariance, which does not exist for distributions with inﬁnite moment generating function,the Gaussian MT-covariance takes ﬁnite values regardless of the underlying probabilitydistribution. Additionally, the Gaussian MT-function has the physical property that itlocalizes linear dependence over the observation space. Hence, Gaussian-MTICA operatesby jointly minimizing the local linear dependencies in the vicinities of the selected set oftest-points. 13 . Computational complexity The exponential-MTICA has two major steps: (1) estimation of M exponential MT-covariance matrices with computational complexity of O (cid:0) M · N · p (cid:1) ﬂops; and (2) NOAJD.The computational complexity of unweighted NOAJD algorithms, such as Pham’s (Pham,2001), FFDIAG (Ziehe et al., 2004), QDIAG (Vollgraf and Obermayer, 2006), and U-WEDGE (Tichavsk´y and Yeredor, 2009) is O (cid:0) L · M · p (cid:1) ﬂops, where L is the numberof iterations. Hence, exponential-MTICA with unweighted NOAJD has computationalcomplexity of O (cid:0) M · N · p + L · M · p (cid:1) ﬂops. When weighted NOAJD is applied usingthe WEDGE algorithm (Tichavsk´y and Yeredor, 2009) with the weighting policy pro-posed in (Slapak and Yeredor, 2011), one has to calculate the weights, with computa-tional complexity of O (cid:0) M · N · p + M · p (cid:1) ﬂops, and to apply weighted NOAJD requir-ing O (cid:0) L · M · p (cid:1) ﬂops. Therefore, exponential-MTICA with weighted NOAJD requires O (cid:0) M · N · p + ( M + M · N ) · p + L · M · p (cid:1) ﬂops.The Gaussian-MTICA algorithm has three steps: (1) a whitening stage with computa-tional complexity of O (cid:0) N · p (cid:1) ﬂops, (2) estimation of M Gaussian MT-covariance matriceswith computational complexity of O (cid:0) M · N · p (cid:1) ﬂops, and (3) OAJD with computationalcomplexity of O (cid:0) L · M · p (cid:1) ﬂops (Flury and Gautschi, 1986; Cardoso and Souloumiac,1996). Thus, Gaussian-MTICA requires O (cid:0) M · N · p + L · M · p (cid:1) ﬂops.Table 1 compares the computational complexity of exponential-MTICA and Gaussian-MTICA to the computational complexity of other ICA techniques, such as JADE, EIMAX,FICA, EFICA, KGV, and RADICAL. Notice that similarly to JADE, FICA, EFICA, andEIMAX the computational complexities of exponential-MTICA and Gaussian-MTICA arelinear in the sample size N , which make them favorable for large data sets. Moreover, onesees that unlike data-based techniques such as EIMAX, FICA, EFICA, KGV, and RAD-ICAL, the iterative part of exponential-MTICA and Gaussian-MTICA has computationalcomplexity that is not aﬀected by the sample size.

8. Numerical examples

In this Section, the performances of exponential-MTICA and Gaussian-MTICA are com-pared to the JADE, EIMAX, FICA, EFICA, KGV, and RADICAL algorithms using theirpublicly available MATLAB code. The JADE, FICA, EFICA, EIMAX, and RADICALalgorithms were used with their default settings. In KGV the Gaussian kernel width param-eter was set to σ = 1. All compared algorithms were initialized by the identity separationmatrix.The test-points t , . . . , t M in the exponential and Gaussian MTICA algorithms wereselected according to the procedures in Appendices G.1 and G.2, respectively. In all sim-ulation examples M = 30 test-points were used. The width parameter of the GaussianMT-function in the Gaussian-MTICA algorithm was set to τ = 1.The exponential-MTICA was implemented with the WEDGE algorithm (Tichavsk´y andYeredor, 2009) to perform weighted non-orthogonal joint diagonalization as proposed in(Slapak and Yeredor, 2011). The Gaussian-MTICA was implemented with the FG algorithm(Flury and Gautschi, 1986) to perform orthogonal joint diagonalization. In all consideredapproximate joint diagonalization algorithms, the initial diagonalizing matrix, the maximum14able 1: Computational complexity of EMTICA, GMTICA, JADE, EIMAX, FICA,EFICA, KGV, and RADICAL. The samples size, dimension, number of itera-tions, and number of matrices to be approximately diagonalized are denoted by N , p , L , and M , respectively. The rank of an N × N Gram matrix after in-complete Cholesky decomposition in the KGV is denoted by D ( N ). The numberof Jacobi angles, and data augmentations in RADICAL are denoted by K and R , respectively. Here EMTICA and GMTICA refer to exponential-MTICA andGaussian-MTICA, respectively. NOAJD stands for non-orthogonal joint diagonal-ization. Algorithm Computational complexity

EMTICA (unweighted NOAD) O (cid:0) M · N · p + L · M · p (cid:1) .EMTICA (weighted NOAD) O (cid:0) M · N · p + ( M + M · N ) · p + L · M · p (cid:1) .GMTICA O (cid:0) M · N · p + L · M · p (cid:1) .JADE O (cid:0) M · N · p + L · M · p (cid:1) .EIMAX O (cid:0) L · N · p (cid:1) .FICA O (cid:0) L · N · p (cid:1) .EFICA O (cid:0) L · N · p (cid:1) .KGV O (cid:0) L · (cid:0) N · D ( N ) · p + D ( N ) · p (cid:1)(cid:1) .RADICAL O (cid:0) L · (cid:0) K · N · R · log ( N · R ) · p (cid:1)(cid:1) .number of iterations and the convergence threshold were set to the identity matrix, 500 and1e-10, respectively. In all ﬁgure legends below, the exponential and Gaussian MTICAalgorithms are abbreviated by EMTICA and GMTICA, respectively.We used the Amari error (Amari et al., 1996) to measure the deviation of the trueseparation matrix B from its estimate ˆ B . The Amari error between two matrices G ∈ R p × p and H ∈ R p × p is deﬁned as: d A ( G , H ) = 12 p ( p − p (cid:88) i =1 (cid:32) (cid:80) pj =1 | Ψ i,j | max j | Ψ i,j | − (cid:33) + 12 p ( p − p (cid:88) j =1 (cid:18) (cid:80) pi =1 | Ψ i,j | max i | Ψ i,j | − (cid:19) , (31)where Ψ i,j = (cid:2) GH − (cid:3) i,j . The Amari error is invariant to permutation and scaling ofthe columns of G and H , and takes values between 0 and 1. Another property is that d A ( G , H ) = 0 if and only if G and H are equal up to scaling and permutation of theircolumns. In addition to the Amari error, some of the trials compared the algorithm runtimes.The simulations were carried out using data obtained from the univariate source dis-tributions in Table 2. The sources were translated and scaled to have zero mean and unitvariance. In order to avoid ill-conditioned mixing, the generated sources were mixed us-ing randomly generated matrices having condition number between one and two. In allexperiments we studies 5-dimensional ICA problems with N = 1000 samples.15able 2: Probability distributions used in the simulation examples. Distribution Parameters

Uniform Support [0 , , µ = 0 and scale parameter σ = 1.Student t-distribution Degrees of freedom κ = 3.Beta Shape parameters α = 2 and β = 2.Exponential Rate parameter λ = 1.Rayleigh Scale parameter σ = 1.Gamma Shape parameter α = 1 and scale parameter σ = 1.Central chi-squared Degrees of freedom κ = 4.Rice Shape parameter α = 1 / In this experiment we studied two types of ICA applications. In the ﬁrst application, thesource distributions are identical. For each of the 10 source distributions in Table 2, weconducted 1000 Monte-Carlo simulations. For each distribution type, box plots of the Amarierrors obtained by each algorithm are depicted in Fig. 1. One sees that exponential-MTICAand Gaussian-MTICA are more robust to source distribution than JADE, EIMAX, FICAand EFICA, with performance similar to the KGV and RADICAL algorithms. One canalso observe that exponential-MTICA is more sensitive to heavy-tailed distributions, suchas Laplace, exponential, and student-t than Gaussian-MTICA.In the second application, the sources were randomly chosen among the 10 possibilities.A total of 1000 Monte-Carlo simulations were performed. The box plots of the Amari errorsobtained by each algorithm are depicted in Fig. 2. Notice that, similarly to the KGV andRADICAL, the exponential-MTICA and Gaussian-MTICA performs better than JADE,FICA, EFICA and EIMAX algorithms. The average run time of each algorithm is givenin Table 3. The run times of exponential-MTICA and Gaussian-MTICA are signiﬁcantlylower than those obtained by KGV and RADICAL. This is due to lower computationalcomplexity, as indicated by Table 1, and more rapid convergence.

In this experiment we demonstrate the robustness of the compared algorithms to outliers.We simulated outliers by randomly corrupting up to 25 data points out of the 1000 samples.This was carried out by adding the value +5 or −

5, chosen with probability 1/2, to a singlecomponent in each of the selected data points. We performed 3000 Monte-Carlo simulationsusing source distributions chosen uniformly at random from the 10 possible distributions inTable 2. The averaged Amari errors produced by each algorithm are depicted in Fig. 3. Onecan observe that, as expected, the proposed Gaussian-MTICA method is less sensitive tooutliers than the exponential-MTICA. This is due to the boundedness of the Gaussian MT-function allowing it to de-emphasize outlying samples distant from its location parameter.The RADICAL algorithm exhibits the least sensitivity to outliers. However, this comes at16

ADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Uniform

JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Arcsine

JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Laplace

JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Beta

JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Student

JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Exponential

JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Rayleigh

JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Gamma

JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Chi sqaured

JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Rice

Figure 1: Sensitivity to source distribution. Box plots of Amari errors obtained by the com-pared algorithms for ﬁve-component ICA with identical source distributions. No-tice that exponential-MTICA and Gaussian-MTICA are robust to source distribu-tion with performance similar to the KGV and RADICAL algorithms. Althoughthe exponential-MTICA, Gaussian-MTICA, KGV and RADICAL algorithms per-form similarly well, the exponential-MTICA and Gaussian-MTICA have reducedcomputational complexity as indicated by Table 1. Also notice that Gaussian-MTICA is less sensitive to heavy-tailed distributions than exponential-MTICA.17

ADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Figure 2: Sensitivity to source distribution. Box plots of Amari errors obtained by the com-pared algorithms for ﬁve-component ICA with randomly chosen distributions.In similar to the KGV and RADICAL, the exponential-MTICA and Gaussian-MTICA perform better than JADE, FICA, EFICA, and EIMAX algorithms.Although the exponential-MTICA, Gaussian-MTICA, KGV, and RADICAL al-gorithms perform similarly well, the exponential-MTICA and Gaussian-MTICAhave reduced computational complexity as indicated by Table 1 and the run timeanalysis in Table 3. 18able 3: Sensitivity to source distribution. Average run times in seconds for ﬁve-componentICA problems with randomly selected source distributions. One sees that the runtimes of exponential-MTICA and Gaussian-MTICA are signiﬁcantly lower thanthose obtained by KGV and RADICAL algorithms.

Algorithm Run time [sec]

EMTICA 3GMTICA 0 . . . . Here we demonstrate that exponential MT-ICA is more robust to model mismatch. Togenerate model mismatch we used the following noisy linear mixing model: X = AS + λ E , (32)where E is a uniformly distributed additive noise vector with statistically independentcomponents having zero mean and unit variance, and λ > (cid:2) AA T (cid:3) p · λ . (33)For each value of SNR ranging from 0 [dB] to 12 [dB] we performed 3000 Monte-Carlosimulations.The averaged Amari errors obtained by each algorithm are depicted in Fig. 4. Observethat for SNRs lower than 8 [dB] exponential-MTICA, which does not require whitening, out-performs all other compared algorithms that are based on whitening and unitary de-mixing.This is due to the fact that for low SNRs the whitened observations signiﬁcantly deviatesfrom unitary mixing. For higher SNRs one can notice that there is less of a separationperformance gap between exponential-MTICA, Gaussian-MTICA, KGV and RADICAL.This is because for higher SNRs the whitened observations admit nearly unitary mixing.

9. Conclusion

In this paper, a new framework for ICA was proposed that is based on applying a structuredtransform to the probability distribution of the data. In MTICA the separation matrix is19 A m a r i e rr o r x JADEEIMAXFICAEFICAKGVRADICALEMTICAGMTICA

Figure 3: Robustness to outliers. The averaged Amari errors obtained by the compared al-gorithms versus number of outliers for ﬁve-component ICA with randomly chosensource distributions. One sees that Gaussian-MTICA is less sensitive to outliersthan exponential-MTICA. The RADICAL algorithm exhibits least sensitivity tooutliers. However, this comes at the expense of increased computational com-plexity as indicated by Table 1 and the run time analysis in Table 3.20 A m a r i e rr o r x JADEIMAXFICAEFICAKGVRADICALEMTICAGMTICA

Figure 4: Sensitivity to model mismatch. The averaged Amari errors obtained by the com-pared algorithms, under the noisy linear mixing model X = AS + λ E , versusSNR. Since for low SNRs the whitened observations largely deviate from unitarymixing, exponential-MTICA outperforms all other algorithms that are based onwhitening and unitary de-mixing. For high SNRs the whitened observations ad-mit nearly unitary mixing, and thus, Gaussian-MTICA, KGV and RADICALattain similar performance as compared to the exponential-MTICA.21stimated via approximate joint diagonalization of some empirical measure-transformed co-variance matrices that are obtained by evaluating the MT-function at diﬀerent test-pointsin the parameter space. By specifying the MT-function in the exponential family the ICAtechnique proposed in (Yeredor, 2000), called here exponential-MTICA, is obtained. Spec-iﬁcation of the MT-function in the Gaussian family resulted in a new ICA algorithm calledGaussian-MTICA. The proposed MTICA approach was tested in simulation examples thatillustrated the advantages of exponential-MTICA and Gaussian-MTICA over state-of-the-art algorithms for ICA. It is likely that there exist other classes of MT-functions that mayresult in other ICA algorithms using the proposed framework.22 ppendix A. Proof Proposition 1: (1) Property 1:

Since ϕ u ( x ) is nonnegative, then by Corollary 2.3.6 in (Athreya and Lahiri, 2006) Q ( u ) X is a measure on S X . Furthermore, Q ( u ) X ( X ) = 1 so that Q ( u ) X is a probability measureon S X .(2) Property 2:

Follows from deﬁnitions 4.1.1 and 4.1.3 in (Athreya and Lahiri, 2006).(3)

Property 3:

According to the deﬁnition of ϕ u ( x ) in (10), the strict positivity of u ( x ), and Property2, we have that Q ( u ) X is absolutely continuous w.r.t. P X with strictly positive Radon-Nikodym derivative dQ ( u ) X ( x ) dP X ( x ) = ϕ u ( x ). Therefore, by Proposition 4.1.2 in (Athreya andLahiri, 2006) it is implied that P X is absolutely continuous w.r.t. Q ( u ) X with a strictlypositive Radon-Nikodym derivative given by dP X ( x ) dQ ( u ) X ( x ) = ϕ − u ( x ) . (34)Using (10) and (34) one can easily verify that ϕ − u ( x ) = u − ( x )E (cid:104) u − ( X ); Q ( u ) X (cid:105) .(4) Property 4 :Let Q ( u ) X k denote the marginal probability measure of Q ( u ) X , deﬁned on S X k . Addition-ally, let A , . . . A p denote arbitrary sets in the σ -algebras S X , . . . , S X p , respectively.Using (7), (9), (10), the assumed statistical independence of X , . . . , X p under P X , andTonelli’s Theorem (Folland, 1984): Q ( u ) X ( A × · · · × A p ) = (cid:90) A ×···× A p u ( x )E [ u ( X ) ; P X ] dP X ( x ) (35)= p (cid:89) k =1 (cid:90) A k u k ( x k )E [ u k ( X k ) ; P X k ] dP X k ( x k ) , which implies that Q ( u ) X k ( A k ) = Q ( u ) X  A k × p (cid:89) i (cid:54) = k X i  = (cid:90) A k u k ( x k )E [ u k ( X k ) ; P X k ] dP X k ( x k ) . (36)By (35) and (36) Q ( u ) X ( A × · · · × A p ) = p (cid:89) k =1 Q ( u ) X k ( A k ) . (37)Therefore, since A , . . . A p are arbitrary, X , . . . , X p are mutually independent underthe transformed probability measure Q ( u ) X .23 ppendix B. Proof of Proposition 3: The inﬂuence function (21) can be written as:IF H u ,P X ( y ) = 1E [ u ( X ) ; P X ] (cid:18)(cid:16)(cid:112) u ( y ) (cid:107) y (cid:107) + (cid:112) u ( y ) (cid:107) µ ( u ) X (cid:107) (cid:17) ψ ( y ) ψ T ( y ) − u ( y ) Σ ( u ) X (cid:19) , (38)where ψ ( y ) (cid:44) y − µ ( u ) X (cid:107) y (cid:107) + (cid:107) µ ( u ) X (cid:107) . (39)By the triangle inequality (cid:107) ψ ( y ) (cid:107) ≤ y ∈ R p , and therefore, the matrix term ψ ( y ) ψ T ( y ) is bounded. Thus, the inﬂuence function IF H u ,P X ( y ) is bounded if u ( y ) and u ( y ) (cid:107) y (cid:107) are bounded. Appendix C. Proof of Theorem 1:

Deﬁne M ( h E ) X ( t ) (cid:44) E (cid:104) exp ( tX ) ; Q ( h E ) X (cid:105) as the moment generating function of X under thetransformed probability measure Q ( h E ) X that is associated with the exponential MT-function h E ( X ; t ) = exp ( t X ) . (40)Using (10), (13), (22) and (40) one can verify that σ ( u E ) X ( t + t ) = ∂ log M ( h E ) X ( t ) ∂t . (41)If the condition in (25) is satisﬁed then by (41) and the properties of the momentgenerating function M ( h E ) X ( t ) = exp (cid:18) µ ( h E ) X t + 12 σ ( h E ) X t (cid:19) ∀ t ∈ ( − (cid:15), (cid:15) ) , (42)where µ ( h E ) X and σ ( h E ) X denote the mean and variance of X under Q ( h E ) X , respectively. Sincethe moment generating function, reduced to any open interval that contains the origin,uniquely determines the distribution (Severini, 2005; DasGupta, 2010), then Q ( h E ) X is aGaussian probability measure. Hence, by Lemma 1 in Subsection 4.1 we conclude that P X is Gaussian.Conversely, if P X is Gaussian then by Lemma 1 in Subsection 4.1 the probability measure Q ( h E ) X is Gaussian, and its corresponding moment generating function M ( h E ) X ( t ) must satisfy(42). Therefore, using (41) one obtains σ ( u E ) X ( t ) = σ ( h E ) X ∀ t ∈ ( t − (cid:15), t + (cid:15) ). Appendix D. Proof of Theorem 2:

We need to prove that (cid:110) ( t , t ) ∈ R p × R p : t (cid:54) = t and Λ ( u E ) S (cid:0) A T t , A T t (cid:1) does not have distinct diagonal entries (cid:111) µ = A T t deﬁnes a bijective mapping from R p to R p thisis equivalent to showing that the Lebesgue measure of the set D (cid:44) (cid:110) ( µ , µ ) ∈ R p × R p : µ (cid:54) = µ and Λ ( u E ) S ( µ , µ ) does not have distinct diagonal entries (cid:111) is zero. By the deﬁnition of Λ ( u E ) S ( µ , µ ) in Proposition 4, the set D can be written as D = p (cid:91) j (cid:54) = k D j,k , (43)where D j,k (cid:44)  ( µ , µ ) ∈ R p × R p : µ (cid:54) = µ and σ ( u E ) S j ( µ ,j ) σ ( u E ) S j ( µ ,j ) = σ ( u E ) S k ( µ ,k ) σ ( u E ) S k ( µ ,k )  , (44) σ ( u E ) S j ( µ i,j ) = (cid:104) Σ ( u E ) S ( µ i ) (cid:105) j,j , and µ i,j = [ µ i ] j , i = 1 , j = 1 , . . . , p . Since at most one ofthe sources is Gaussian, then either S j or S k must be non-Gaussian for j (cid:54) = k . Let S k denotethe non-Gaussian source. By Theorem 1, the exponential MT-variance σ ( u E ) S k ( µ ,k ) is notconstant over any open interval. Thus, for almost every ( µ ,j , µ ,j , µ ,k , µ ,k ) ∈ R for whichthe quotients in (44) are ﬁnite σ ( u E ) Sj ( µ ,j ) σ ( u E ) Sj ( µ ,j ) (cid:54) = σ ( u E ) Sk ( µ ,k ) σ ( u E ) Sk ( µ ,k ) . Hence, the Lebesgue measure of D j,k is zero for any j (cid:54) = k . Therefore, by relation (43) and the sub-additivity of Lebesgue’smeasure, the set D must have zero Lebesgue measure. Appendix E. Proof of Theorem 3:

Deﬁne M ( h G ) X ( t ) (cid:44) E (cid:104) exp ( tX ) ; Q ( h G ) X (cid:105) as the moment generating function of X under thetransformed probability measure Q ( h G ) X associated with the Gaussian MT-function h G ( X ; t , τ ) = exp (cid:32) − ( X − t ) τ (cid:33) . (45)Using (10), (13), (28) and (45) one can verify that σ ( u G ) X ( t + t , τ ) = τ ∂ log M ( h G ) X (cid:0) t/τ (cid:1) ∂t . (46)If the condition in (29) is satisﬁed then by (46) and the properties of the momentgenerating function M ( h G ) X ( t ) = exp (cid:18) µ ( h G ) X t + 12 σ ( h G ) X t (cid:19) ∀ t ∈ (cid:0) − (cid:15) (cid:48) , (cid:15) (cid:48) (cid:1) , (47)where µ ( h G ) X and σ ( h G ) X denote the mean and the variance of X under Q ( h G ) X , respectively,and (cid:15) (cid:48) (cid:44) (cid:15)/τ . Since the moment generating function, reduced to any open interval that25ontains the origin, uniquely determines the distribution (Severini, 2005; DasGupta, 2010) itis implied that Q ( h G ) X is a Gaussian probability measure. Hence, by Lemma 2 in Subsection5.1 we conclude that P X is Gaussian.Conversely, if P X is Gaussian then by Lemma 2 in Subsection 5.1 the probability measure Q ( h G ) X is Gaussian, and its corresponding moment generating function M ( h G ) X ( t ) must satisfy(47). Therefore, using (46) one obtains σ ( u G ) X ( t, τ ) = σ ( h G ) X ∀ t ∈ ( t − (cid:15), t + (cid:15) ). Appendix F. Proof of Theorem 4:

We need to prove that (cid:110) t ∈ R p : Σ ( u G ) S (cid:0) U T t , τ (cid:1) does not have distinct diagonal entries (cid:111) haszero Lebesgue measure. Since the relation µ = U T t deﬁnes a bijective mapping from R p to R p it is suﬃcient to show that the Lebesgue measure of the set D (cid:44) (cid:110) µ ∈ R p : Σ ( u G ) S ( µ , τ ) does not have distinct diagonal entries (cid:111) (48)is zero. Clearly, the set D can be written as D = p (cid:91) j (cid:54) = k D j,k , (49)where D j,k (cid:44) (cid:110) µ ∈ R p : σ ( u G ) S j ( µ j , τ ) = σ ( u G ) S k ( µ k , τ ) (cid:111) , (50) σ ( u G ) S j ( µ j ) = (cid:104) Σ ( u G ) S ( µ , τ ) (cid:105) j,j and µ j = [ µ ] j . Since at most one of the sources is Gaussian,then either S j or S k must be non-Gaussian for j (cid:54) = k . Let S k denote the non-Gaussiansource. By Theorem 3 the Gaussian MT-variance σ ( u G ) S k ( µ k , τ ) is not constant w.r.t. µ k over any open interval. Thus, for almost every ( µ j , µ k ) ∈ R we have that σ ( u G ) S j ( µ j , τ ) (cid:54) = σ ( u G ) S k ( µ k , τ ). Hence, the Lebesgue measure of D j,k is zero for any j (cid:54) = k . Therefore,by relation (49) and the sub-additivity of Lebesgue’s measure, the set D must have zeroLebesgue measure. Appendix G. Choice of MT-function parameters

G.1 Exponential MTICA

Assume that t , . . . , t M are independent samples from some continuous probability distri-bution. According to Theorem 2 if at most one of the sources is Gaussian, then for any pair( t m , t n ), m (cid:54) = n , Assumption 3 in Proposition 4 is satisﬁed with probability 1 that leads tounique identiﬁcation of A based on the corresponding MT-covariance matrices Σ ( u E ) X ( t m )and Σ ( u E ) X ( t n ).Motivated by this result we propose the following procedure that randomly generatestest-points inside a unit l -ball:(1) Generate M i.i.d samples r m ∈ R p , m = 1 , . . . , M from the standard normal distribu-tion. 262) Generate M i.i.d. samples c m ∈ R , m = 1 , . . . , M with uniform distribution on [0 , t m = c m r m (cid:107) r m (cid:107) , m = 1 , . . . , M. G.2 Gaussian MTICA

Assume that t , . . . , t M are independent samples from some continuous probability distri-bution. According to Theorem 4 if at most one of the sources is Gaussian, then for any m = 1 , . . . , M the Gaussian MT-covariance Σ ( u G ) Z ( t m , τ ) in (30) has distinct eigenvalueswith probability 1 that leads to unique identiﬁcation of the mixing matrix A .Motivated by this result, and assuming that the data is centered and whitened, wepropose to generate M i.i.d. vectors t m , m = 1 , . . . , M , such that the components ofeach t m are statistically independent with zero mean and unit variance. In all consideredexamples we used the beta distribution with identical shape parameters α = β = 3. References

S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation.

Advances in Neural Information Processing Systems , pages 757–763, 1996.K. B. Athreya and S. N. Lahiri.

Measure theory and probability theory . Springer-Verlag,2006.F. R. Bach and M. I. Jordan. Kernel independent component analysis.

Journal of MachineLearning Research , 3:1–48, 2002.A. D. Back and A. S. Weigend. A ﬁrst application of independent component analysis toextracting structure from stock returns.

International journal of neural systems , 8(4):473–484, 1997.A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separationand blind deconvolution.

Neural Computation , 7(6):1129–1159, 1995.A. Belouchrani, K. Abed-Meraim, J. F. Cardoso, and E. Moulines. A blind source separationtechnique using second-order statistics.

IEEE Transactions on Signal Processing , 45(2):434–444, 1997.R. Boscolo, H. Pan, and V. P. Roychowdhury. Independent component analysis based onnonparametric density estimation.

IEEE Transactions on Neural Networks , 15(1):55–65,2004.V. Calhoun, G. Pearlson, and T. Adalı. Independent component analysis applied to fMRIdata: a generative model for validating results.

The Journal of VLSI Signal Processing ,37(2):281–291, 2004.J. F. Cardoso. High-order contrasts for independent component analysis.

Neural Compu-tation , 11(1):157–192, 1999. 27. F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization.

SIAMJournal on Matrix Analysis and Applications , 17(1):161–164, 1996.J. F. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals.

Radar andSignal Processing, IEE Proceedings F , 140(6):362–370, 1993.P. Common. Independent component analysis, a new concept?

Signal Processing , 36(3):287–314, 1994.D. R. Cox and D. V. Hinkley.

Theoretical Statistics . Chapman and Hall, New York, NY,1974.A. DasGupta.

Fundamentals of Probability: A First Course . Springer-Verlag, 2010.A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonalityconstraints.

SIAM Journal on Matrix Analysis and Applications , 20(2):303–353, 1999.J. Errikson and V. Koivunen. Identiﬁability, separability, and uniqueness of linear ICAmodels.

IEEE signal processing letters , 11(7):601–604, 2004.E. M. Fadaili, N. Thirion-Moreau, and E. Moreau. Non orthogonal jointdiagonalization/zero-diagonalization for source separation based on time-frequency dis-tributions.

IEEE Transactions on Signal Processing , 55(5):1673–1687, 2007.B. N. Flury and W. Gautschi. An algorithm for simultaneous orthogonal transformation ofseveral positive deﬁnite symmetric matrices to nearly diagonal form.

SIAM Journal onScientiﬁc and Statistical Computing , 7(1):169–184, 1986.G. B. Folland.

Real Analysis . John Wiley and Sons, 1984.F. R. Hampel. The inﬂuence curve and its role in robust estimation.

Journal of the AmericanStatistical Association , 69(346):383–393, 1974.A. Hyv¨arinen and E. Oja. A fast ﬁxed-point algorithm for independent component analysis.

Neural computation , 9(7):1483–1492, 1997.A. Hyv¨arinen, J. Karhunen, and E. Oja.

Independent Component Analysis . Wiley, 2001.A. Kagan, Y. Linnik, and C. Rao.

Characterization Problems in Mathematical Statistics .Wiley, 1973.Z. Koldovsky, P. Tichavsky, and E. Oja. Eﬃcient variant of algorithm FastICA for inde-pendent component analysis attaining the Cram´er-rao lower bound.

IEEE Transactionson Signal Processing , 17(5):1265–1277, 2006.E. G. Learned-Miller and J. W. Fisher. ICA using spacings estimates of entropy.

Journalof Machine Learning Research , 4:1271–1295, 2003.T. W. Lee, M. Girolami, and T. J. Sejnowski. Independent component analysis using anextended infomax algorithm for mixed sub-Gaussian and super-Gaussian sources.

NeuralComputation , 11(2):417–441, 1999. 28. L. Li and T. Adalı. Independent component analysis by entropy bound minimization.

IEEE Transactions on Signal Processing , 58(10):5151–5164, 2010.X. L. Li and X. D. Zhang. Nonorthogonal joint diagonalization free of degenerate solution.

IEEE Transactions on Signal Processing , 55(5):1803–1814, 2007.C. J. Liu, T. S. Lee, and C. C. Chiu. Financial time series forecasting using independentcomponent analysis and support vector regression.

Decision Support Systems , 47(2):115–125, 2009.S. Makeig, A. J. Bell, T.P. Jung, and T. J. Sejnowski. Independent component analysisof electroencephalographic data.

Advances in Neural Information Processing Systems , 8:145–151, 1996.D. T. Pham. Joint approximate diagonalization of positive-deﬁnite matrices.

SIAM Journalon Matrix Analysis and Applications , 22(4):1136–1152, 2001.D. T. Pham and P. Garat. Blind separation of mixture of independent sources througha quasi-maximum likelihood approach.

IEEE Transactions on Signal Processing , 11(2):417–441, 1999.T. Qin, X. Guan, W. Li, and P. Wang. Monitoring abnormal traﬃc ﬂows based on inde-pendent component analysis. In

Proceedings of IEEE international conference on com-munications, 2009, ICC’09 , pages 1–5, 2009.T. A. Severini.

Elements of distribution theory . Cambridge University Press, 2005.H. Shen, S. Jegelka, and A. Gretton. Fast kernel-based independent component analysis.

IEEE Transactions on Signal Processing , 57(9):3498–3511, 2009.A. Slapak and A. Yeredor. “Weighting for more”: Enhancing characteristic-function basedICA with asymptotically optimal weighting.

Signal Processing , 91(8):2016–2027, 2011.P. Tichavsk´y and A. Yeredor. Fast approximate joint diagonalization incorporating weightmatrices.

IEEE Transactions on Signal Processing , 57(3):878–891, 2009.K. Todros and A. O. Hero. On measure transformed canonical correlation analysis.

IEEETransactions on Signal Processing , 60(9):4570–4585, 2012.K. Todros and J. Tabrikian. Blind separation of independent sources using Gaussian mixturemodel.

IEEE Transactions on Signal Processing , 55(7):3645–3658, 2007.K. Todros and J. Tabrikian. QML-based joint diagonalization of positive-deﬁnite Hermitianmatrices.

IEEE Transactions on Signal Processing , 56(9):4656–4673, 2010.R. Vollgraf and K. Obermayer. Quadratic optimization for simultaneous matrix diagonal-ization.

IEEE Transactions on Signal Processing , 54(9):3270–3278, 2006.A. Yeredor. Blind source separation via the second characteristic function.

Signal Processing ,80(5):897–902, 2000. 29. Yeredor. Nonorthogonal joint diagonalization in the least-squares sense with applicationin blind source separation.

IEEE Transactions on Signal Processing , 50(7):1545–1553,2002.L. Zhao, Y. Jiawei, Y. Junliang, and C. Ting. An independent component analysis basedmultiuser mimo downlink transmission scheme. In

Proceedings of IEEE internationalConference on Networks Security, Wireless Communications and Trusted Computing,2009. NSWCTC’09 , volume 1, pages 374–377, 2009.A. Ziehe, P. Laskov, G. Nolte, and K. R. M¨uller. A fast algorithm for joint diagonaliza-tion with nonorthogonal transformations and its application to blind source separation.