Measure Transformed Independent Component Analysis
MMeasure Transformed Independent Component Analysis
Koby Todros [email protected]
Department of Electrical and Computer EngineeringBen-Gurion University of the NegevBeer-Sheva 84105, Israel
Alfred O. Hero [email protected]
Department of Electrical Engineering and Computer ScienceUniversity of MichiganAnn-Arbor 48105, MI, U.S.A
Editor:
Abstract
In this paper we derive a new framework for independent component analysis (ICA), calledmeasure-transformed ICA (MTICA), that is based on applying a structured transform tothe probability distribution of the observation vector, i.e., transformation of the proba-bility measure defined on its observation space. By judicious choice of the transform weshow that the separation matrix can be uniquely determined via diagonalization of severalmeasure-transformed covariance matrices. In MTICA the separation matrix is estimatedvia approximate joint diagonalization of several empirical measure-transformed covariancematrices. Unlike kernel based ICA techniques where the transformation is applied repeti-tively to some affine mappings of the observation vector, in MTICA the transformation isapplied only once to the probability distribution of the observations. This results in per-formance advantages and reduced implementation complexity. Simulations demonstratethe advantages of the proposed approach as compared to other existing state-of-the-artmethods for ICA.
Keywords:
Approximate joint diagonalization, blind source separation, independentcomponent analysis, probability measure transform.
1. Introduction
Independent component analysis (ICA) is a technique for multivariate data analysis thataims at decomposing an observed random vector into linear combination of mutually inde-pendent random variables (Common, 1994; Hyv¨arinen et al., 2001). The observation vectoris assumed to be generated by an unknown linear mixture of mutually independent latentvariables, called sources, with unknown distributions. The coefficient matrix of the linearmixture is called the mixing matrix and assumed to be invertible. Given a sequence of i.i.d.samples from the distribution of the observed vector, ICA aims to estimate the inverse of themixing matrix, called the separation matrix, that is used for recovering the sources. Unlikeprincipal component analysis, ICA can deal with a general mixing structure, which is notconstrained to be orthogonal. The mutual independence assumption is plausible in a widevariety of fields, including telecommunications (Qin et al., 2009; Zhao et al., 2009), finance(Back and Weigend, 1997; Liu et al., 2009), and biomedical signal analysis (Makeig et al.,1 a r X i v : . [ s t a t . M E ] D ec
2. Independent component analysis: Review
Let X = [ X , . . . , X p ] T denote a random vector, whose observation space is X ⊆ R p . Wedefine the measure space ( X , S X , P X ), where S X is a σ -algebra over X , and P X is the jointprobability measure on S X . Let X k denote the observation space of X k . The marginalprobability measure of P X on S X k is denoted by P X k , were S X k is the σ -algebra over X k .Let g ( · ) denote an integrable scalar function on X . The expectation of g ( X ) under P X isdefined as E [ g ( X ) ; P X ] (cid:44) (cid:90) X g ( x ) dP X ( x ) , (1)where x ∈ X . The components of X are mutually independent under P X ifE [ g ( X j ) h ( X k ) ; P X ] = E (cid:2) g ( X j ) ; P X j (cid:3) E [ h ( X k ) ; P X k ] ∀ j (cid:54) = k, (2)for all integrable scalar functions g ( · ), h ( · ) on X . The components of X are mutuallyuncorrelated under P X ifE [ X j X k ; P X ] = E (cid:2) X j ; P X j (cid:3) E [ X k ; P X k ] ∀ j (cid:54) = k. (3)4he empirical distribution, ˆ P X , based on a sequence of samples X n , n = 1 , . . . , N , is specifiedby ˆ P X ( A ) = 1 N N (cid:88) n =1 δ X n ( A ) , (4)where A ∈ S X and δ X n ( · ) is the Dirac probability measure at X n (Folland, 1984). The instantaneous noiseless ICA model takes the following form: X = AS , (5)where X ∈ R p , p ≥
2, is an observed random vector, A ∈ R p × p is an invertible unknown ma-trix, called the mixing matrix, and S ∈ R p is a latent random vector comprised of mutuallyindependent variables having finite second-order moments and unknown distributions. Thecomponents of S are also called sources. Under the model (5) it has been shown that themixing matrix A can be uniquely identified, up to permutation and scaling of its columns,if and only if at most one of the sources is Gaussian (Kagan et al., 1973; Common, 1994;Hyv¨arinen et al., 2001; Errikson and Koivunen, 2004). Given a sequence of i.i.d. samplesfrom P X , ICA aims to estimate the separation matrix B = A − and thus, recover thesources using the relation S = BX .Many state-of-the-art ICA algorithms, such as JADE, FICA, EFICA, EIMAX, KGVand RADICAL, referenced in Section 1, apply whitening to the observed vector X . Thewhitened observation vector is represented as Z (cid:44) WX = US , (6)where W ∈ R p × p is the whitening matrix and U (cid:44) WA . Assuming, without loss ofgenerality, that the components of S have unit variances, one can easily verify that thematrix U is unitary leading to a unitary mixing model. Let V (cid:44) U T , where ( · ) T denotesthe transpose operator. ICA algorithms that use whitening implement an estimate of V using constraint optimization over the Stiefel manifold of unitary matrices (Edelman et al.,1999). The empirical separation matrix is then obtained using the relation B = VW .
3. Measure transformed ICA
In this section the MTICA procedure is presented. First, a transform that maps a proba-bility measure P X into a set of probability measures (cid:110) Q ( u ) X (cid:111) on S X is defined that has theproperty that it preserves mutual independence between the components of X under P X .Second, we define the measure-transformed covariance and derive its strongly consistentestimate, which is also shown to be Fisher consistent (Cox and Hinkley, 1974). Robust-ness of the empirical measure-transformed covariance to outliers is studied by analyzing theboundedness of its influence function (Hampel, 1974). Finally, based on the mixing models(5), (6), the MTICA procedure is specified by performing approximate joint diagonalizationof a set of empirical measure-transformed covariance matrices.5 .1 Probability measure transformDefinition 1. Given a probability measure P X and a non-negative function u : R p → R + satisfying u ( X ) = p (cid:89) k =1 u k ( X k ) , u k : R → R + , k = 1 , . . . , p, (7) and < E [ u ( X ) ; P X ] < ∞ , (8) a transform on P X is defined via the following relation: Q ( u ) X ( A ) (cid:44) T u [ P X ] ( A ) = (cid:90) A ϕ u ( x ) dP X ( x ) , (9) where A ∈ S X , x = [ x , . . . , x p ] T ∈ X , and ϕ u ( x ) (cid:44) u ( X )E [ u ( X ) ; P X ] . (10) The function u ( · ) , associated with the transform T u [ · ] , is called the MT-function. In the following Proposition, some properties of the measure transform (9) are given.
Proposition 1.
Let Q ( u ) X be defined by relation (9). Then (1) Q ( u ) X is a probability measure on S X . (2) Q ( u ) X is absolutely continuous w.r.t. P X , with Radon-Nikodym derivative (Folland, 1984)given by dQ ( u ) X ( x ) dP X ( x ) = ϕ u ( x ) . (11)(3) Assume that the MT-function u ( · ) is strictly positive, then P X is absolutely continuousw.r.t. Q ( u ) X with a strictly positive Radon-Nikodym derivative given by dP X ( x ) dQ ( u ) X ( x ) = ϕ − u ( x ) = u − ( x )E (cid:104) u − ( X ) ; Q ( u ) X (cid:105) . (12)(4) If X , . . . , X p are mutually independent under P X , then they are mutually independentunder Q ( u ) X .[A proof is given in Appendix A] The probability measure Q ( u ) X is said to be generated by the MT-function u ( · ). Bymodifying u ( · ), such that the conditions (7), (8) are satisfied, virtually any probabilitymeasure on S X can be obtained. 6 .2 The measure-transformed covariance According to (1) and (11) the measure-transformed covariance of X under Q ( u ) X is given by Σ ( u ) X = E (cid:2) XX T ϕ u ( X ) ; P X (cid:3) − µ ( u ) X µ ( u ) T X , (13)where µ ( u ) X (cid:44) E [ X ϕ u ( X ) ; P X ] (14)is the measure-transformed expectation of X under Q ( u ) X . Equation (13) implies that Σ ( u ) X is a weighted covariance matrix of X under P X , with weighting function ϕ u ( · ). Hence, Σ ( u ) X can be estimated using only samples from the distribution P X . By modifying theMT-function u ( · ), such that the conditions (7), (8) are satisfied, the MT-covariance matrixunder Q ( u ) X is modified. In particular, by choosing u ( x ) ≡
1, we have Q ( u ) X = P X , and thestandard covariance matrix is obtained.In the following Proposition a strongly consistent estimate of the measure-transformedcovariance is given that is based on i.i.d. samples from the probability distribution P X . Proposition 2.
Let X n , n = 1 , . . . , N denote a sequence of i.i.d. samples from the distri-bution P X , and define the empirical covariance estimate ˆ Σ ( u ) X (cid:44) N (cid:88) n =1 X n X Tn ˆ ϕ u ( X n ) − ˆ µ ( u ) x ˆ µ ( u ) T x , (15) where ˆ µ ( u ) X (cid:44) N (cid:88) n =1 X n ˆ ϕ u ( X n ) , (16) and ˆ ϕ u ( X n ) (cid:44) u ( X n ) N (cid:80) n =1 u ( X n ) . (17) Assume E (cid:2) u ( X ) ; P X (cid:3) < ∞ and E (cid:2) X k ; P X (cid:3) < ∞ ∀ k = 1 , . . . , p. (18) Then ˆ Σ ( u ) X → Σ ( u ) X almost surely as N → ∞ . [The proof is similar to the proof of Proposition3 in (Todros and Hero, 2012) and therefore is omitted] Note that for u ( X ) ≡ NN − ˆ Σ ( u ) X reduces to the standard unbiased esti-mator of the covariance matrix Σ X . Also notice that ˆ Σ ( u ) X can be written as a statisticalfunctional H u [ · ] of the empirical probability measure ˆ P X (4), i.e.,ˆ Σ ( u ) X = E (cid:104) XX T u ( X ) ; ˆ P X (cid:105) E (cid:104) u ( X ) ; ˆ P X (cid:105) − E (cid:104) X u ( X ) ; ˆ P X (cid:105) E (cid:104) X T u ( X ) ; ˆ P X (cid:105) E (cid:104) u ( X ) ; ˆ P X (cid:105) (cid:44) H u (cid:104) ˆ P X (cid:105) . (19)According to (10), (13), (14), and (19), when ˆ P X is replaced by the true probability measure P X we have H u [ P X ] = Σ ( u ) X , which implies that ˆ Σ ( u ) X is Fisher consistent (Cox and Hinkley,1974). 7 .3 Robustness of the empirical MT-covariance to outliers Here, we study the robustness of the empirical MT-covariance ˆ Σ ( u ) X to outliers using itsinfluence function. Define the probability measure P (cid:15) (cid:44) (1 − (cid:15) ) P X + (cid:15)δ y , where 0 ≤ (cid:15) ≤ y ∈ R p , and δ y is a Dirac probability measure at y . The influence function of a Fisherconsistent estimator with statistical functional H [ · ] at probability distribution P X is definedpointwise as (Hampel, 1974):IF H ,P X ( y ) (cid:44) lim (cid:15) → H [ P (cid:15) ] − H [ P X ] (cid:15) = ∂ H [ P (cid:15) ] ∂(cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 . (20)The influence function describes the effect on the estimator of an infinitesimal contaminationat the point y . An estimator is said to be B-robust if its influence function is bounded. Using(10), (13), (14), (19) and (20) one can verify that the influence function of the empiricalMT-covariance is given byIF H u ,P X ( y ) = u ( y )E [ u ( X ) ; P X ] (cid:18)(cid:16) y − µ ( u ) X (cid:17) (cid:16) y − µ ( u ) X (cid:17) T − Σ ( u ) X (cid:19) . (21)The following proposition states a sufficient condition for boundedness of (21). Proposition 3.
The influence function (21) is bounded if the MT-function u ( y ) is bounded,and there exists a constant c > such that u ( y ) ≤ c (cid:107) y (cid:107) − , where (cid:107) · (cid:107) denotes the l -norm.[A proof is given in Appendix B] In Section 5 we show that MT-functions chosen from the Gaussian family of functionssatisfy these conditions, resulting in a measure-transformed ICA algorithm that is resilientto outliers.
In MTICA we choose a sequence of MT-functions u m ( · ), m = 1 , . . . , M that satisfies atleast one of the following conditions:(1) Under the ICA model (5) the separation matrix B is the unique matrix (up to per-mutation and scaling of its rows) that jointly diagonalizes the MT-covariance matrices Σ ( u m ) X , m = 1 , . . . , M .(2) Under the unitary mixing model (6) the matrix V = U T is the unique matrix (up topermutation and sign of its rows) that jointly diagonalizes the MT-covariance matrices Σ ( u m ) Z , m = 1 , . . . , M .When the first condition is satisfied, the separation matrix B is estimated via NOAJDof the empirical MT-covariances ˆ Σ ( u m ) X , m = 1 , . . . , M . The NOAJD (Pham, 2001; Yere-dor, 2002; Ziehe et al., 2004; Vollgraf and Obermayer, 2006; Fadaili et al., 2007; Li andZhang, 2007; Todros and Tabrikian, 2010) seeks a non-singular matrix ˆ B ∈ R p × p , suchthat ˆ B ˆ Σ ( u m ) X ˆ B T , m = 1 , . . . , M are “as diagonal as possible” in the sense that a deviationmeasure from diagonality is minimized. The MTICA procedure in this case is summarizedin Algorithm 1. 8 lgorithm 1 MTICA with no whitening
Input:
A sequence of data samples X n , n = 1 , . . . , N . Choose a sequence of MT-functions u m ( · ), m = 1 , . . . , M , such that B is the uniquejoint diagonalization marix of Σ ( u m ) X , m = 1 , . . . , M . Using (15)-(17) derive the empirical MT-covariances ˆ Σ ( u m ) X , m = 1 , . . . , M . Find the NOAJD matrix ˆ B of ˆ Σ ( u m ) X , m = 1 , . . . , M . Output:
The empirical separation matrix ˆ B .Alternatively, when the second condition is satisfied the observations are whitened, andthe estimate of V is obtained via OAJD of the empirical MT-covariance matrices ˆ Σ ( u m )ˆ Z , m = 1 , . . . , M , where ˆ Z (cid:44) ˆ WX and ˆ W is the empirical whitening matrix. The OAJD (Fluryand Gautschi, 1986; Cardoso and Souloumiac, 1996) seeks a unitary matrix ˆ V ∈ R p × p , suchthat ˆ V ˆ Σ ( u m )ˆ Z ˆ V T , m = 1 , . . . , M are “as diagonal as possible” by, once again, minimizing adeviation measure from diagonality. The empirical separation matrix is obtained by takingˆ B = ˆ V ˆ W . The MTICA procedure in this case is summarized in Algorithm 2. Algorithm 2
MTICA with whitening
Input:
A sequence of data samples X n , n = 1 , . . . , N . Choose a sequence of MT-functions u m ( · ), m = 1 , . . . , M , such that V is the uniquejoint diagonalization matrix of Σ ( u m ) Z , m = 1 , . . . , M . Estimate the whitening matrix ˆ W . Generate the sequence ˆ Z n = ˆ WX n , n = 1 , . . . , N . Using (15)-(17) derive the empirical MT-covariances ˆ Σ ( u m )ˆ Z , m = 1 , . . . , M . Find the OAJD matrix ˆ V of ˆ Σ ( u m )ˆ Z , m = 1 , . . . , M . Output:
Obtain an estimate of B by taking ˆ B = ˆ V ˆ W .By modifying the MT-functions such that the stated conditions are satisfied a family ofmeasure-transformed independent component analyses can be obtained. Particular choicesof MT-functions leading to the exponential and Gaussian MTICA algorithms are discussedin the next sections.
4. Exponential-MTICA
In this section we parameterize the MT-function u ( · ; t ), with scaling parameter t ∈ R p underthe exponential family of functions. Under this choice of MT-function the MT-covarianceis given by the Hessian of the cumulant-generating-function resulting in the ICA algorithmproposed in (Yeredor, 2000). Let u E ( · ; · ) be defined as the parameterized function u E ( x ; t ) (cid:44) exp (cid:0) t T x (cid:1) , (22)9here t ∈ R . Using (10), (13) and (22) one can easily verify that the covariance matrix of X under Q ( u E ) X takes the form Σ ( u E ) X ( t ) = ∂ log M X ( t ) ∂ t ∂ t T , (23)where M X ( t ) (cid:44) E (cid:2) exp (cid:0) t T X (cid:1) ; P X (cid:3) (24)is the moment generating function of X , and it is assumed that M X ( t ) is finite in someopen region in R p containing the origin. Note that the covariance matrix in (23) involveshigher-order statistics of X . Additionally, observe that Σ ( u E ) X ( t ) reduces to the standardcovariance matrix Σ X for t = .In the following lemma, directly following from (10)-(12) and the definition of the expo-nential MT-function (22), one sees that the Gaussian family of probability measures is closedunder the measure transformation (9) when generated by the exponential MT-function. Lemma 1.
A random vector X is Gaussian under the probability measure P X if and onlyif it remains Gaussian under the transformed probability measure Q ( u E ) X generated by theexponential MT-function u E ( · ; · ) . This property is used in proving the following theorem that states a necessary andsufficient condition for Gaussianity of a random variable X based on its exponential MT-variance. Theorem 1.
A random variable X with corresponding probability measure P X is Gaussianif and only if the the exponential MT-variance satisfies σ ( u E ) X ( t ) = c ∀ t ∈ ( t − (cid:15), t + (cid:15) ) , (25) where c and (cid:15) are some positive constants and t is an arbitrary point in R . [A proof isgiven in Appendix C] Hence, if a random variable X is non-Gaussian then its exponential MT-variance σ ( u E ) X ( t )cannot be constant over any open interval. This property is used in the following subsectionto establish identifiability of the mixing matrix A . Using (5), (10), (13) and (22) it can be shown that for any choice of the scaling parameter t the exponential MT-covariance of the observation vector X has the following structure: Σ ( u E ) X ( t ) = AΣ ( u E ) S (cid:0) A T t (cid:1) A T , (26)where Σ ( u E ) S ( · ) is the covariance matrix of the latent vector S under the transformed prob-ability measure Q ( u E ) S . Since the components of S are mutually independent under P S , byProperty 4 in Proposition 1, they are mutually independent under Q ( u E ) S , and therefore, Σ ( u E ) S ( · ) must be diagonal. Thus, the following property follows directly from (26):10 roposition 4. Let t and t , t (cid:54) = t , denote two arbitrary points in R p . Assume that (1) The matrices Σ ( u E ) S (cid:0) A T t (cid:1) , and Σ ( u E ) S (cid:0) A T t (cid:1) have finite diagonal entries, (2) The diagonal entries of Σ ( u E ) S (cid:0) A T t (cid:1) are non-zero, and (3) The matrix Λ ( u E ) S (cid:0) A T t , A T t (cid:1) (cid:44) Σ ( u E ) S (cid:0) A T t (cid:1) Σ ( u E ) − S (cid:0) A T t (cid:1) has distinct diagonalentries, i.e., no pair of diagonal entries have the same value.Then, A can be uniquely identified, up to scaling and permutation of its columns, by solvingthe following non-symmetric eigenvalue decomposition problem: Σ ( u E ) X ( t ) Σ ( u E ) − X ( t ) A = AΛ ( u E ) S (cid:0) A T t , A T t (cid:1) . (27) [A proof is given in (Yeredor, 2000)]. As a result of property (25) of the exponential MT-variance, the following Theoremshows that Assumption 3 in Proposition 4 is satisfied almost everywhere (a.e.) if at mostone of the components of S is Gaussian. Theorem 2.
If at most one of the sources is Gaussian, then for t (cid:54) = t the matrix Λ ( u E ) S (cid:0) A T t , A T t (cid:1) has distinct diagonal entries a.e. [A proof is given in Appendix D] According to (26), Proposition 4, and Theorem 2, the separation matrix B = A − is theunique matrix that jointly diagonalizes two exponential MT-covariance matrices Σ ( u E ) X ( t )and Σ ( u E ) X ( t ) that satisfy the stated assumptions. The exponential-MTICA algorithm(Yeredor, 2000) is obtained by replacing the MT-functions u m ( · ), m = 1 , . . . , M , in Algo-rithm 1 with a sequence of exponential MT-functions u E ( · ; t m ), m = 1 , . . . , M . A procedurefor choosing the test-points t m ∈ R p , m = 1 , . . . , M , is given in Appendix G.1. Clearly,only two test-points are needed for obtaining a viable estimate of B . However, in order toincrease statistical stability and reduce the effect of ill-conditioned empirical MT-covariancematrices it is better to use a sequence of more than two test-points.
5. Gaussian-MTICA
In this section we parameterize the MT-function u ( · ; t , τ ), with translation parameter t ∈ R p and width parameter τ ∈ R ∗ + using a Gaussian family of functions. Under the unitary mixingmodel (6) we show that if at most one of the sources is Gaussian, the mixing matrix U can be uniquely identified via eigenvalue decomposition of a single Gaussian MT-covariancematrix. Based on this result the Gaussian-MTICA algorithm is obtained that applies OAJDto a sequence of empirical Gaussian MT-covariance matrices. We define the Gaussian MT-function u G ( · ; · , · ) as u G ( x ; t , τ ) (cid:44) exp (cid:32) − (cid:107) x − t (cid:107) τ (cid:33) , (28)11here t ∈ R p , and τ ∈ R ∗ + . Since u G ( · ; · , · ) is strictly positive and bounded, one caneasily verify that the condition (8) is always satisfied. Relations (10) and (13) imply thatthe MT-function (28) produces a weighted covariance matrix, Σ ( u G ) X ( t , τ ), for which theobservations are weighted in inverse proportion to the distance (cid:107) x − t (cid:107) . This results ina kind of local covariance analysis of X in the vicinity of the test-point t . Notice thatthe Gaussian MT-function (28) satisfies the conditions in Proposition 3, and therefore, theinfluence function of the empirical Gaussian MT-covariance is bounded. Hence, unlike theempirical exponential MT-covariance, whose influence function is unbounded, the empiricalGaussian MT-covariance is robust to outlying observations.Similarly to the exponential measure transformation, the following lemma, directly fol-lowing from (10)-(12) and the definition of the Gaussian MT-function (28), states that theGaussian family of probability measures is closed under the measure transformation (9)generated by the Gaussian MT-function. Lemma 2.
A random vector X is Gaussian under the probability measure P X if and onlyif it remains Gaussian under the transformed probability measure Q ( u G ) X generated by theGaussian MT-function u G ( · ; · , · ) . This property is used in proving the following theorem that states a necessary andsufficient condition for Gaussianity of a random variable X based on its Gaussian MT-variance. Theorem 3.
A random variable X with corresponding probability measure P X is Gaussianif and only if the the Gaussian MT-variance satisfies σ ( u G ) X ( t, τ ) = c ∀ t ∈ ( t − (cid:15), t + (cid:15) ) , (29) where c and (cid:15) are some positive constants and t is some arbitrary point in R [A proof isgiven in Appendix E]. Hence, similarly to the exponential MT-variance, if a random variable X is non-Gaussianthen for any choice of the width parameter τ ∈ R ∗ + the Gaussian MT-variance σ ( u G ) X ( t, τ )cannot be constant w.r.t. t over any open interval. This property is used in the followingsubsection for proving identifiability of the mixing matrix U . According to (6), (10), (13) and (28) the MT-covariance of the whitened observation vector Z under Q ( u G ) Z has the following structure: Σ ( u G ) Z ( t , τ ) = UΣ ( u G ) S (cid:0) U T t , τ (cid:1) U T , (30)where Σ ( u G ) S ( · , · ) is the covariance matrix of S under the transformed probability measure Q ( u G ) S . Since the components of S are mutually independent under P S , then by Property 4in Proposition 1 they are mutually independent under Q ( u G ) S , and thus, Σ ( u G ) S ( · , · ) must bediagonal. Therefore, assuming that Σ ( u G ) S (cid:0) U T t , τ (cid:1) has distinct finite diagonal entries, the12nitary matrix U can be uniquely identified (up to permutation and sign of its columns)via eigenvalue decomposition of the Gaussian MT-covariance Σ ( u G ) Z ( t , τ ).Based on property (29) of the Gaussian MT-variance, the following theorem states thatif at most one of the components of S is Gaussian, then Σ ( u G ) S (cid:0) U T t , τ (cid:1) has distinct diagonalentries for almost every t ∈ R p . Theorem 4.
If at most one of the sources is Gaussian, then the matrix Σ ( u G ) S (cid:0) U T t , τ (cid:1) hasdistinct diagonal entries a.e. [A proof is given in Appendix F] According to (30) and Theorem 4, if at most one of the sources is Gaussian, then for al-most every t ∈ R p the matrix V = U T is the unique diagonalizing matrix of the GaussianMT-covariance Σ ( u G ) Z ( t , τ ). Thus, the Gaussian-MTICA algorithm is implemented by re-placing the MT-functions u m ( · ), m = 1 , . . . , M in Algorithm 2 with Gaussian MT-functions u G ( · ; t m , τ ), m = 1 , . . . , M , where the width parameter τ ∈ R ∗ + is fixed. A procedure forchoosing the test-points t m ∈ R p , m = 1 , . . . , M is given in Appendix G.2. Clearly, only onetest-point is needed for estimating V . However, estimation of V based on diagonalizationof a single empirical Gaussian MT-covariance has the following drawbacks: (1) For somechoice of the translation parameter t the eigen-spectrum of the corresponding GaussianMT-covariance may be degenerate, i.e., the eigenvalues may not be well separated. (2) Asingle Gaussian MT-covariance may only capture part of the statistical information about Z necessary to separate the sources effectively. In order to alleviate these drawbacks it maybe better to use more than a single test-point.
6. Comparisons between exponential and Gaussian MTICA
Unlike Gaussian-MTICA that requires whitening, which under the model (5) leads to uni-tary mixing, exponential-MTICA does not require whitening. Therefore, as illustrated inSubsection 8.3, exponential-MTICA is more robust to cases where the whitened observa-tions are poorly modeled by unitary mixing. Moreover, in Gaussian-MTICA, one has to seta width parameter τ not required for exponential-MTICA.On the other hand, unlike the exponential MT-function, the Gaussian MT-function isbounded and isotropically de-emphasizes samples distant from its location parameter. Thisproperty leads to the following advantages of Gaussian-MTICA over exponential-MTICA:(1) As illustrated in Subsections 8.1 and 8.2, Gaussian-MTICA is more robust to heavy-tailed distributions and outliers than exponential-MTICA. (2) Unlike the exponential MT-covariance, which does not exist for distributions with infinite moment generating function,the Gaussian MT-covariance takes finite values regardless of the underlying probabilitydistribution. Additionally, the Gaussian MT-function has the physical property that itlocalizes linear dependence over the observation space. Hence, Gaussian-MTICA operatesby jointly minimizing the local linear dependencies in the vicinities of the selected set oftest-points. 13 . Computational complexity The exponential-MTICA has two major steps: (1) estimation of M exponential MT-covariance matrices with computational complexity of O (cid:0) M · N · p (cid:1) flops; and (2) NOAJD.The computational complexity of unweighted NOAJD algorithms, such as Pham’s (Pham,2001), FFDIAG (Ziehe et al., 2004), QDIAG (Vollgraf and Obermayer, 2006), and U-WEDGE (Tichavsk´y and Yeredor, 2009) is O (cid:0) L · M · p (cid:1) flops, where L is the numberof iterations. Hence, exponential-MTICA with unweighted NOAJD has computationalcomplexity of O (cid:0) M · N · p + L · M · p (cid:1) flops. When weighted NOAJD is applied usingthe WEDGE algorithm (Tichavsk´y and Yeredor, 2009) with the weighting policy pro-posed in (Slapak and Yeredor, 2011), one has to calculate the weights, with computa-tional complexity of O (cid:0) M · N · p + M · p (cid:1) flops, and to apply weighted NOAJD requir-ing O (cid:0) L · M · p (cid:1) flops. Therefore, exponential-MTICA with weighted NOAJD requires O (cid:0) M · N · p + ( M + M · N ) · p + L · M · p (cid:1) flops.The Gaussian-MTICA algorithm has three steps: (1) a whitening stage with computa-tional complexity of O (cid:0) N · p (cid:1) flops, (2) estimation of M Gaussian MT-covariance matriceswith computational complexity of O (cid:0) M · N · p (cid:1) flops, and (3) OAJD with computationalcomplexity of O (cid:0) L · M · p (cid:1) flops (Flury and Gautschi, 1986; Cardoso and Souloumiac,1996). Thus, Gaussian-MTICA requires O (cid:0) M · N · p + L · M · p (cid:1) flops.Table 1 compares the computational complexity of exponential-MTICA and Gaussian-MTICA to the computational complexity of other ICA techniques, such as JADE, EIMAX,FICA, EFICA, KGV, and RADICAL. Notice that similarly to JADE, FICA, EFICA, andEIMAX the computational complexities of exponential-MTICA and Gaussian-MTICA arelinear in the sample size N , which make them favorable for large data sets. Moreover, onesees that unlike data-based techniques such as EIMAX, FICA, EFICA, KGV, and RAD-ICAL, the iterative part of exponential-MTICA and Gaussian-MTICA has computationalcomplexity that is not affected by the sample size.
8. Numerical examples
In this Section, the performances of exponential-MTICA and Gaussian-MTICA are com-pared to the JADE, EIMAX, FICA, EFICA, KGV, and RADICAL algorithms using theirpublicly available MATLAB code. The JADE, FICA, EFICA, EIMAX, and RADICALalgorithms were used with their default settings. In KGV the Gaussian kernel width param-eter was set to σ = 1. All compared algorithms were initialized by the identity separationmatrix.The test-points t , . . . , t M in the exponential and Gaussian MTICA algorithms wereselected according to the procedures in Appendices G.1 and G.2, respectively. In all sim-ulation examples M = 30 test-points were used. The width parameter of the GaussianMT-function in the Gaussian-MTICA algorithm was set to τ = 1.The exponential-MTICA was implemented with the WEDGE algorithm (Tichavsk´y andYeredor, 2009) to perform weighted non-orthogonal joint diagonalization as proposed in(Slapak and Yeredor, 2011). The Gaussian-MTICA was implemented with the FG algorithm(Flury and Gautschi, 1986) to perform orthogonal joint diagonalization. In all consideredapproximate joint diagonalization algorithms, the initial diagonalizing matrix, the maximum14able 1: Computational complexity of EMTICA, GMTICA, JADE, EIMAX, FICA,EFICA, KGV, and RADICAL. The samples size, dimension, number of itera-tions, and number of matrices to be approximately diagonalized are denoted by N , p , L , and M , respectively. The rank of an N × N Gram matrix after in-complete Cholesky decomposition in the KGV is denoted by D ( N ). The numberof Jacobi angles, and data augmentations in RADICAL are denoted by K and R , respectively. Here EMTICA and GMTICA refer to exponential-MTICA andGaussian-MTICA, respectively. NOAJD stands for non-orthogonal joint diagonal-ization. Algorithm Computational complexity
EMTICA (unweighted NOAD) O (cid:0) M · N · p + L · M · p (cid:1) .EMTICA (weighted NOAD) O (cid:0) M · N · p + ( M + M · N ) · p + L · M · p (cid:1) .GMTICA O (cid:0) M · N · p + L · M · p (cid:1) .JADE O (cid:0) M · N · p + L · M · p (cid:1) .EIMAX O (cid:0) L · N · p (cid:1) .FICA O (cid:0) L · N · p (cid:1) .EFICA O (cid:0) L · N · p (cid:1) .KGV O (cid:0) L · (cid:0) N · D ( N ) · p + D ( N ) · p (cid:1)(cid:1) .RADICAL O (cid:0) L · (cid:0) K · N · R · log ( N · R ) · p (cid:1)(cid:1) .number of iterations and the convergence threshold were set to the identity matrix, 500 and1e-10, respectively. In all figure legends below, the exponential and Gaussian MTICAalgorithms are abbreviated by EMTICA and GMTICA, respectively.We used the Amari error (Amari et al., 1996) to measure the deviation of the trueseparation matrix B from its estimate ˆ B . The Amari error between two matrices G ∈ R p × p and H ∈ R p × p is defined as: d A ( G , H ) = 12 p ( p − p (cid:88) i =1 (cid:32) (cid:80) pj =1 | Ψ i,j | max j | Ψ i,j | − (cid:33) + 12 p ( p − p (cid:88) j =1 (cid:18) (cid:80) pi =1 | Ψ i,j | max i | Ψ i,j | − (cid:19) , (31)where Ψ i,j = (cid:2) GH − (cid:3) i,j . The Amari error is invariant to permutation and scaling ofthe columns of G and H , and takes values between 0 and 1. Another property is that d A ( G , H ) = 0 if and only if G and H are equal up to scaling and permutation of theircolumns. In addition to the Amari error, some of the trials compared the algorithm runtimes.The simulations were carried out using data obtained from the univariate source dis-tributions in Table 2. The sources were translated and scaled to have zero mean and unitvariance. In order to avoid ill-conditioned mixing, the generated sources were mixed us-ing randomly generated matrices having condition number between one and two. In allexperiments we studies 5-dimensional ICA problems with N = 1000 samples.15able 2: Probability distributions used in the simulation examples. Distribution Parameters
Uniform Support [0 , , µ = 0 and scale parameter σ = 1.Student t-distribution Degrees of freedom κ = 3.Beta Shape parameters α = 2 and β = 2.Exponential Rate parameter λ = 1.Rayleigh Scale parameter σ = 1.Gamma Shape parameter α = 1 and scale parameter σ = 1.Central chi-squared Degrees of freedom κ = 4.Rice Shape parameter α = 1 / In this experiment we studied two types of ICA applications. In the first application, thesource distributions are identical. For each of the 10 source distributions in Table 2, weconducted 1000 Monte-Carlo simulations. For each distribution type, box plots of the Amarierrors obtained by each algorithm are depicted in Fig. 1. One sees that exponential-MTICAand Gaussian-MTICA are more robust to source distribution than JADE, EIMAX, FICAand EFICA, with performance similar to the KGV and RADICAL algorithms. One canalso observe that exponential-MTICA is more sensitive to heavy-tailed distributions, suchas Laplace, exponential, and student-t than Gaussian-MTICA.In the second application, the sources were randomly chosen among the 10 possibilities.A total of 1000 Monte-Carlo simulations were performed. The box plots of the Amari errorsobtained by each algorithm are depicted in Fig. 2. Notice that, similarly to the KGV andRADICAL, the exponential-MTICA and Gaussian-MTICA performs better than JADE,FICA, EFICA and EIMAX algorithms. The average run time of each algorithm is givenin Table 3. The run times of exponential-MTICA and Gaussian-MTICA are significantlylower than those obtained by KGV and RADICAL. This is due to lower computationalcomplexity, as indicated by Table 1, and more rapid convergence.
In this experiment we demonstrate the robustness of the compared algorithms to outliers.We simulated outliers by randomly corrupting up to 25 data points out of the 1000 samples.This was carried out by adding the value +5 or −
5, chosen with probability 1/2, to a singlecomponent in each of the selected data points. We performed 3000 Monte-Carlo simulationsusing source distributions chosen uniformly at random from the 10 possible distributions inTable 2. The averaged Amari errors produced by each algorithm are depicted in Fig. 3. Onecan observe that, as expected, the proposed Gaussian-MTICA method is less sensitive tooutliers than the exponential-MTICA. This is due to the boundedness of the Gaussian MT-function allowing it to de-emphasize outlying samples distant from its location parameter.The RADICAL algorithm exhibits the least sensitivity to outliers. However, this comes at16
ADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Uniform
JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Arcsine
JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Laplace
JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Beta
JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Student
JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Exponential
JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Rayleigh
JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Gamma
JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Chi sqaured
JADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Rice
Figure 1: Sensitivity to source distribution. Box plots of Amari errors obtained by the com-pared algorithms for five-component ICA with identical source distributions. No-tice that exponential-MTICA and Gaussian-MTICA are robust to source distribu-tion with performance similar to the KGV and RADICAL algorithms. Althoughthe exponential-MTICA, Gaussian-MTICA, KGV and RADICAL algorithms per-form similarly well, the exponential-MTICA and Gaussian-MTICA have reducedcomputational complexity as indicated by Table 1. Also notice that Gaussian-MTICA is less sensitive to heavy-tailed distributions than exponential-MTICA.17
ADE EIMAX FICA EFICA KICA RADICAL EMTICA GMTICA0102030405060708090100 A m a r i e rr o r x Figure 2: Sensitivity to source distribution. Box plots of Amari errors obtained by the com-pared algorithms for five-component ICA with randomly chosen distributions.In similar to the KGV and RADICAL, the exponential-MTICA and Gaussian-MTICA perform better than JADE, FICA, EFICA, and EIMAX algorithms.Although the exponential-MTICA, Gaussian-MTICA, KGV, and RADICAL al-gorithms perform similarly well, the exponential-MTICA and Gaussian-MTICAhave reduced computational complexity as indicated by Table 1 and the run timeanalysis in Table 3. 18able 3: Sensitivity to source distribution. Average run times in seconds for five-componentICA problems with randomly selected source distributions. One sees that the runtimes of exponential-MTICA and Gaussian-MTICA are significantly lower thanthose obtained by KGV and RADICAL algorithms.
Algorithm Run time [sec]
EMTICA 3GMTICA 0 . . . . Here we demonstrate that exponential MT-ICA is more robust to model mismatch. Togenerate model mismatch we used the following noisy linear mixing model: X = AS + λ E , (32)where E is a uniformly distributed additive noise vector with statistically independentcomponents having zero mean and unit variance, and λ > (cid:2) AA T (cid:3) p · λ . (33)For each value of SNR ranging from 0 [dB] to 12 [dB] we performed 3000 Monte-Carlosimulations.The averaged Amari errors obtained by each algorithm are depicted in Fig. 4. Observethat for SNRs lower than 8 [dB] exponential-MTICA, which does not require whitening, out-performs all other compared algorithms that are based on whitening and unitary de-mixing.This is due to the fact that for low SNRs the whitened observations significantly deviatesfrom unitary mixing. For higher SNRs one can notice that there is less of a separationperformance gap between exponential-MTICA, Gaussian-MTICA, KGV and RADICAL.This is because for higher SNRs the whitened observations admit nearly unitary mixing.
9. Conclusion
In this paper, a new framework for ICA was proposed that is based on applying a structuredtransform to the probability distribution of the data. In MTICA the separation matrix is19 A m a r i e rr o r x JADEEIMAXFICAEFICAKGVRADICALEMTICAGMTICA
Figure 3: Robustness to outliers. The averaged Amari errors obtained by the compared al-gorithms versus number of outliers for five-component ICA with randomly chosensource distributions. One sees that Gaussian-MTICA is less sensitive to outliersthan exponential-MTICA. The RADICAL algorithm exhibits least sensitivity tooutliers. However, this comes at the expense of increased computational com-plexity as indicated by Table 1 and the run time analysis in Table 3.20 A m a r i e rr o r x JADEIMAXFICAEFICAKGVRADICALEMTICAGMTICA
Figure 4: Sensitivity to model mismatch. The averaged Amari errors obtained by the com-pared algorithms, under the noisy linear mixing model X = AS + λ E , versusSNR. Since for low SNRs the whitened observations largely deviate from unitarymixing, exponential-MTICA outperforms all other algorithms that are based onwhitening and unitary de-mixing. For high SNRs the whitened observations ad-mit nearly unitary mixing, and thus, Gaussian-MTICA, KGV and RADICALattain similar performance as compared to the exponential-MTICA.21stimated via approximate joint diagonalization of some empirical measure-transformed co-variance matrices that are obtained by evaluating the MT-function at different test-pointsin the parameter space. By specifying the MT-function in the exponential family the ICAtechnique proposed in (Yeredor, 2000), called here exponential-MTICA, is obtained. Spec-ification of the MT-function in the Gaussian family resulted in a new ICA algorithm calledGaussian-MTICA. The proposed MTICA approach was tested in simulation examples thatillustrated the advantages of exponential-MTICA and Gaussian-MTICA over state-of-the-art algorithms for ICA. It is likely that there exist other classes of MT-functions that mayresult in other ICA algorithms using the proposed framework.22 ppendix A. Proof Proposition 1: (1) Property 1:
Since ϕ u ( x ) is nonnegative, then by Corollary 2.3.6 in (Athreya and Lahiri, 2006) Q ( u ) X is a measure on S X . Furthermore, Q ( u ) X ( X ) = 1 so that Q ( u ) X is a probability measureon S X .(2) Property 2:
Follows from definitions 4.1.1 and 4.1.3 in (Athreya and Lahiri, 2006).(3)
Property 3:
According to the definition of ϕ u ( x ) in (10), the strict positivity of u ( x ), and Property2, we have that Q ( u ) X is absolutely continuous w.r.t. P X with strictly positive Radon-Nikodym derivative dQ ( u ) X ( x ) dP X ( x ) = ϕ u ( x ). Therefore, by Proposition 4.1.2 in (Athreya andLahiri, 2006) it is implied that P X is absolutely continuous w.r.t. Q ( u ) X with a strictlypositive Radon-Nikodym derivative given by dP X ( x ) dQ ( u ) X ( x ) = ϕ − u ( x ) . (34)Using (10) and (34) one can easily verify that ϕ − u ( x ) = u − ( x )E (cid:104) u − ( X ); Q ( u ) X (cid:105) .(4) Property 4 :Let Q ( u ) X k denote the marginal probability measure of Q ( u ) X , defined on S X k . Addition-ally, let A , . . . A p denote arbitrary sets in the σ -algebras S X , . . . , S X p , respectively.Using (7), (9), (10), the assumed statistical independence of X , . . . , X p under P X , andTonelli’s Theorem (Folland, 1984): Q ( u ) X ( A × · · · × A p ) = (cid:90) A ×···× A p u ( x )E [ u ( X ) ; P X ] dP X ( x ) (35)= p (cid:89) k =1 (cid:90) A k u k ( x k )E [ u k ( X k ) ; P X k ] dP X k ( x k ) , which implies that Q ( u ) X k ( A k ) = Q ( u ) X A k × p (cid:89) i (cid:54) = k X i = (cid:90) A k u k ( x k )E [ u k ( X k ) ; P X k ] dP X k ( x k ) . (36)By (35) and (36) Q ( u ) X ( A × · · · × A p ) = p (cid:89) k =1 Q ( u ) X k ( A k ) . (37)Therefore, since A , . . . A p are arbitrary, X , . . . , X p are mutually independent underthe transformed probability measure Q ( u ) X .23 ppendix B. Proof of Proposition 3: The influence function (21) can be written as:IF H u ,P X ( y ) = 1E [ u ( X ) ; P X ] (cid:18)(cid:16)(cid:112) u ( y ) (cid:107) y (cid:107) + (cid:112) u ( y ) (cid:107) µ ( u ) X (cid:107) (cid:17) ψ ( y ) ψ T ( y ) − u ( y ) Σ ( u ) X (cid:19) , (38)where ψ ( y ) (cid:44) y − µ ( u ) X (cid:107) y (cid:107) + (cid:107) µ ( u ) X (cid:107) . (39)By the triangle inequality (cid:107) ψ ( y ) (cid:107) ≤ y ∈ R p , and therefore, the matrix term ψ ( y ) ψ T ( y ) is bounded. Thus, the influence function IF H u ,P X ( y ) is bounded if u ( y ) and u ( y ) (cid:107) y (cid:107) are bounded. Appendix C. Proof of Theorem 1:
Define M ( h E ) X ( t ) (cid:44) E (cid:104) exp ( tX ) ; Q ( h E ) X (cid:105) as the moment generating function of X under thetransformed probability measure Q ( h E ) X that is associated with the exponential MT-function h E ( X ; t ) = exp ( t X ) . (40)Using (10), (13), (22) and (40) one can verify that σ ( u E ) X ( t + t ) = ∂ log M ( h E ) X ( t ) ∂t . (41)If the condition in (25) is satisfied then by (41) and the properties of the momentgenerating function M ( h E ) X ( t ) = exp (cid:18) µ ( h E ) X t + 12 σ ( h E ) X t (cid:19) ∀ t ∈ ( − (cid:15), (cid:15) ) , (42)where µ ( h E ) X and σ ( h E ) X denote the mean and variance of X under Q ( h E ) X , respectively. Sincethe moment generating function, reduced to any open interval that contains the origin,uniquely determines the distribution (Severini, 2005; DasGupta, 2010), then Q ( h E ) X is aGaussian probability measure. Hence, by Lemma 1 in Subsection 4.1 we conclude that P X is Gaussian.Conversely, if P X is Gaussian then by Lemma 1 in Subsection 4.1 the probability measure Q ( h E ) X is Gaussian, and its corresponding moment generating function M ( h E ) X ( t ) must satisfy(42). Therefore, using (41) one obtains σ ( u E ) X ( t ) = σ ( h E ) X ∀ t ∈ ( t − (cid:15), t + (cid:15) ). Appendix D. Proof of Theorem 2:
We need to prove that (cid:110) ( t , t ) ∈ R p × R p : t (cid:54) = t and Λ ( u E ) S (cid:0) A T t , A T t (cid:1) does not have distinct diagonal entries (cid:111) µ = A T t defines a bijective mapping from R p to R p thisis equivalent to showing that the Lebesgue measure of the set D (cid:44) (cid:110) ( µ , µ ) ∈ R p × R p : µ (cid:54) = µ and Λ ( u E ) S ( µ , µ ) does not have distinct diagonal entries (cid:111) is zero. By the definition of Λ ( u E ) S ( µ , µ ) in Proposition 4, the set D can be written as D = p (cid:91) j (cid:54) = k D j,k , (43)where D j,k (cid:44) ( µ , µ ) ∈ R p × R p : µ (cid:54) = µ and σ ( u E ) S j ( µ ,j ) σ ( u E ) S j ( µ ,j ) = σ ( u E ) S k ( µ ,k ) σ ( u E ) S k ( µ ,k ) , (44) σ ( u E ) S j ( µ i,j ) = (cid:104) Σ ( u E ) S ( µ i ) (cid:105) j,j , and µ i,j = [ µ i ] j , i = 1 , j = 1 , . . . , p . Since at most one ofthe sources is Gaussian, then either S j or S k must be non-Gaussian for j (cid:54) = k . Let S k denotethe non-Gaussian source. By Theorem 1, the exponential MT-variance σ ( u E ) S k ( µ ,k ) is notconstant over any open interval. Thus, for almost every ( µ ,j , µ ,j , µ ,k , µ ,k ) ∈ R for whichthe quotients in (44) are finite σ ( u E ) Sj ( µ ,j ) σ ( u E ) Sj ( µ ,j ) (cid:54) = σ ( u E ) Sk ( µ ,k ) σ ( u E ) Sk ( µ ,k ) . Hence, the Lebesgue measure of D j,k is zero for any j (cid:54) = k . Therefore, by relation (43) and the sub-additivity of Lebesgue’smeasure, the set D must have zero Lebesgue measure. Appendix E. Proof of Theorem 3:
Define M ( h G ) X ( t ) (cid:44) E (cid:104) exp ( tX ) ; Q ( h G ) X (cid:105) as the moment generating function of X under thetransformed probability measure Q ( h G ) X associated with the Gaussian MT-function h G ( X ; t , τ ) = exp (cid:32) − ( X − t ) τ (cid:33) . (45)Using (10), (13), (28) and (45) one can verify that σ ( u G ) X ( t + t , τ ) = τ ∂ log M ( h G ) X (cid:0) t/τ (cid:1) ∂t . (46)If the condition in (29) is satisfied then by (46) and the properties of the momentgenerating function M ( h G ) X ( t ) = exp (cid:18) µ ( h G ) X t + 12 σ ( h G ) X t (cid:19) ∀ t ∈ (cid:0) − (cid:15) (cid:48) , (cid:15) (cid:48) (cid:1) , (47)where µ ( h G ) X and σ ( h G ) X denote the mean and the variance of X under Q ( h G ) X , respectively,and (cid:15) (cid:48) (cid:44) (cid:15)/τ . Since the moment generating function, reduced to any open interval that25ontains the origin, uniquely determines the distribution (Severini, 2005; DasGupta, 2010) itis implied that Q ( h G ) X is a Gaussian probability measure. Hence, by Lemma 2 in Subsection5.1 we conclude that P X is Gaussian.Conversely, if P X is Gaussian then by Lemma 2 in Subsection 5.1 the probability measure Q ( h G ) X is Gaussian, and its corresponding moment generating function M ( h G ) X ( t ) must satisfy(47). Therefore, using (46) one obtains σ ( u G ) X ( t, τ ) = σ ( h G ) X ∀ t ∈ ( t − (cid:15), t + (cid:15) ). Appendix F. Proof of Theorem 4:
We need to prove that (cid:110) t ∈ R p : Σ ( u G ) S (cid:0) U T t , τ (cid:1) does not have distinct diagonal entries (cid:111) haszero Lebesgue measure. Since the relation µ = U T t defines a bijective mapping from R p to R p it is sufficient to show that the Lebesgue measure of the set D (cid:44) (cid:110) µ ∈ R p : Σ ( u G ) S ( µ , τ ) does not have distinct diagonal entries (cid:111) (48)is zero. Clearly, the set D can be written as D = p (cid:91) j (cid:54) = k D j,k , (49)where D j,k (cid:44) (cid:110) µ ∈ R p : σ ( u G ) S j ( µ j , τ ) = σ ( u G ) S k ( µ k , τ ) (cid:111) , (50) σ ( u G ) S j ( µ j ) = (cid:104) Σ ( u G ) S ( µ , τ ) (cid:105) j,j and µ j = [ µ ] j . Since at most one of the sources is Gaussian,then either S j or S k must be non-Gaussian for j (cid:54) = k . Let S k denote the non-Gaussiansource. By Theorem 3 the Gaussian MT-variance σ ( u G ) S k ( µ k , τ ) is not constant w.r.t. µ k over any open interval. Thus, for almost every ( µ j , µ k ) ∈ R we have that σ ( u G ) S j ( µ j , τ ) (cid:54) = σ ( u G ) S k ( µ k , τ ). Hence, the Lebesgue measure of D j,k is zero for any j (cid:54) = k . Therefore,by relation (49) and the sub-additivity of Lebesgue’s measure, the set D must have zeroLebesgue measure. Appendix G. Choice of MT-function parameters
G.1 Exponential MTICA
Assume that t , . . . , t M are independent samples from some continuous probability distri-bution. According to Theorem 2 if at most one of the sources is Gaussian, then for any pair( t m , t n ), m (cid:54) = n , Assumption 3 in Proposition 4 is satisfied with probability 1 that leads tounique identification of A based on the corresponding MT-covariance matrices Σ ( u E ) X ( t m )and Σ ( u E ) X ( t n ).Motivated by this result we propose the following procedure that randomly generatestest-points inside a unit l -ball:(1) Generate M i.i.d samples r m ∈ R p , m = 1 , . . . , M from the standard normal distribu-tion. 262) Generate M i.i.d. samples c m ∈ R , m = 1 , . . . , M with uniform distribution on [0 , t m = c m r m (cid:107) r m (cid:107) , m = 1 , . . . , M. G.2 Gaussian MTICA
Assume that t , . . . , t M are independent samples from some continuous probability distri-bution. According to Theorem 4 if at most one of the sources is Gaussian, then for any m = 1 , . . . , M the Gaussian MT-covariance Σ ( u G ) Z ( t m , τ ) in (30) has distinct eigenvalueswith probability 1 that leads to unique identification of the mixing matrix A .Motivated by this result, and assuming that the data is centered and whitened, wepropose to generate M i.i.d. vectors t m , m = 1 , . . . , M , such that the components ofeach t m are statistically independent with zero mean and unit variance. In all consideredexamples we used the beta distribution with identical shape parameters α = β = 3. References
S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation.
Advances in Neural Information Processing Systems , pages 757–763, 1996.K. B. Athreya and S. N. Lahiri.
Measure theory and probability theory . Springer-Verlag,2006.F. R. Bach and M. I. Jordan. Kernel independent component analysis.
Journal of MachineLearning Research , 3:1–48, 2002.A. D. Back and A. S. Weigend. A first application of independent component analysis toextracting structure from stock returns.
International journal of neural systems , 8(4):473–484, 1997.A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separationand blind deconvolution.
Neural Computation , 7(6):1129–1159, 1995.A. Belouchrani, K. Abed-Meraim, J. F. Cardoso, and E. Moulines. A blind source separationtechnique using second-order statistics.
IEEE Transactions on Signal Processing , 45(2):434–444, 1997.R. Boscolo, H. Pan, and V. P. Roychowdhury. Independent component analysis based onnonparametric density estimation.
IEEE Transactions on Neural Networks , 15(1):55–65,2004.V. Calhoun, G. Pearlson, and T. Adalı. Independent component analysis applied to fMRIdata: a generative model for validating results.
The Journal of VLSI Signal Processing ,37(2):281–291, 2004.J. F. Cardoso. High-order contrasts for independent component analysis.
Neural Compu-tation , 11(1):157–192, 1999. 27. F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization.
SIAMJournal on Matrix Analysis and Applications , 17(1):161–164, 1996.J. F. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals.
Radar andSignal Processing, IEE Proceedings F , 140(6):362–370, 1993.P. Common. Independent component analysis, a new concept?
Signal Processing , 36(3):287–314, 1994.D. R. Cox and D. V. Hinkley.
Theoretical Statistics . Chapman and Hall, New York, NY,1974.A. DasGupta.
Fundamentals of Probability: A First Course . Springer-Verlag, 2010.A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonalityconstraints.
SIAM Journal on Matrix Analysis and Applications , 20(2):303–353, 1999.J. Errikson and V. Koivunen. Identifiability, separability, and uniqueness of linear ICAmodels.
IEEE signal processing letters , 11(7):601–604, 2004.E. M. Fadaili, N. Thirion-Moreau, and E. Moreau. Non orthogonal jointdiagonalization/zero-diagonalization for source separation based on time-frequency dis-tributions.
IEEE Transactions on Signal Processing , 55(5):1673–1687, 2007.B. N. Flury and W. Gautschi. An algorithm for simultaneous orthogonal transformation ofseveral positive definite symmetric matrices to nearly diagonal form.
SIAM Journal onScientific and Statistical Computing , 7(1):169–184, 1986.G. B. Folland.
Real Analysis . John Wiley and Sons, 1984.F. R. Hampel. The influence curve and its role in robust estimation.
Journal of the AmericanStatistical Association , 69(346):383–393, 1974.A. Hyv¨arinen and E. Oja. A fast fixed-point algorithm for independent component analysis.
Neural computation , 9(7):1483–1492, 1997.A. Hyv¨arinen, J. Karhunen, and E. Oja.
Independent Component Analysis . Wiley, 2001.A. Kagan, Y. Linnik, and C. Rao.
Characterization Problems in Mathematical Statistics .Wiley, 1973.Z. Koldovsky, P. Tichavsky, and E. Oja. Efficient variant of algorithm FastICA for inde-pendent component analysis attaining the Cram´er-rao lower bound.
IEEE Transactionson Signal Processing , 17(5):1265–1277, 2006.E. G. Learned-Miller and J. W. Fisher. ICA using spacings estimates of entropy.
Journalof Machine Learning Research , 4:1271–1295, 2003.T. W. Lee, M. Girolami, and T. J. Sejnowski. Independent component analysis using anextended infomax algorithm for mixed sub-Gaussian and super-Gaussian sources.
NeuralComputation , 11(2):417–441, 1999. 28. L. Li and T. Adalı. Independent component analysis by entropy bound minimization.
IEEE Transactions on Signal Processing , 58(10):5151–5164, 2010.X. L. Li and X. D. Zhang. Nonorthogonal joint diagonalization free of degenerate solution.
IEEE Transactions on Signal Processing , 55(5):1803–1814, 2007.C. J. Liu, T. S. Lee, and C. C. Chiu. Financial time series forecasting using independentcomponent analysis and support vector regression.
Decision Support Systems , 47(2):115–125, 2009.S. Makeig, A. J. Bell, T.P. Jung, and T. J. Sejnowski. Independent component analysisof electroencephalographic data.
Advances in Neural Information Processing Systems , 8:145–151, 1996.D. T. Pham. Joint approximate diagonalization of positive-definite matrices.
SIAM Journalon Matrix Analysis and Applications , 22(4):1136–1152, 2001.D. T. Pham and P. Garat. Blind separation of mixture of independent sources througha quasi-maximum likelihood approach.
IEEE Transactions on Signal Processing , 11(2):417–441, 1999.T. Qin, X. Guan, W. Li, and P. Wang. Monitoring abnormal traffic flows based on inde-pendent component analysis. In
Proceedings of IEEE international conference on com-munications, 2009, ICC’09 , pages 1–5, 2009.T. A. Severini.
Elements of distribution theory . Cambridge University Press, 2005.H. Shen, S. Jegelka, and A. Gretton. Fast kernel-based independent component analysis.
IEEE Transactions on Signal Processing , 57(9):3498–3511, 2009.A. Slapak and A. Yeredor. “Weighting for more”: Enhancing characteristic-function basedICA with asymptotically optimal weighting.
Signal Processing , 91(8):2016–2027, 2011.P. Tichavsk´y and A. Yeredor. Fast approximate joint diagonalization incorporating weightmatrices.
IEEE Transactions on Signal Processing , 57(3):878–891, 2009.K. Todros and A. O. Hero. On measure transformed canonical correlation analysis.
IEEETransactions on Signal Processing , 60(9):4570–4585, 2012.K. Todros and J. Tabrikian. Blind separation of independent sources using Gaussian mixturemodel.
IEEE Transactions on Signal Processing , 55(7):3645–3658, 2007.K. Todros and J. Tabrikian. QML-based joint diagonalization of positive-definite Hermitianmatrices.
IEEE Transactions on Signal Processing , 56(9):4656–4673, 2010.R. Vollgraf and K. Obermayer. Quadratic optimization for simultaneous matrix diagonal-ization.
IEEE Transactions on Signal Processing , 54(9):3270–3278, 2006.A. Yeredor. Blind source separation via the second characteristic function.
Signal Processing ,80(5):897–902, 2000. 29. Yeredor. Nonorthogonal joint diagonalization in the least-squares sense with applicationin blind source separation.
IEEE Transactions on Signal Processing , 50(7):1545–1553,2002.L. Zhao, Y. Jiawei, Y. Junliang, and C. Ting. An independent component analysis basedmultiuser mimo downlink transmission scheme. In
Proceedings of IEEE internationalConference on Networks Security, Wireless Communications and Trusted Computing,2009. NSWCTC’09 , volume 1, pages 374–377, 2009.A. Ziehe, P. Laskov, G. Nolte, and K. R. M¨uller. A fast algorithm for joint diagonaliza-tion with nonorthogonal transformations and its application to blind source separation.