[PDF] Notion of information and independent component analysis

Abstract

Full PDF

NN OTION OF INFORMATION AND INDEPENDENT COMPONENTANALYSIS

A P

REPRINT

Una Radojicic

Vienna University of Technology [email protected]

Klaus Nordhausen

Vienna University of Technology [email protected]

Hannu Oja

University of Turku [email protected]

June 22, 2020 A BSTRACT

Partial orderings and measures of information for continuous univariate random variables with specialroles of Gaussian and uniform distributions are discussed. The information measures and measuresof non-Gaussianity including third and fourth cumulants are generally used as projection indices inthe projection pursuit approach for the independent component analysis. The connections betweeninformation, non-Gaussianity and statistical independence in the context of independent componentanalysis is discussed in detail. K eywords Dispersion · entropy · kurtosis · partial orderings In the engineering literature independent component analysis (ICA) [12, 23] is often described as a search for theuncorrelated linear combinations of the original variables that maximize non-Gaussianity. The estimation procedurethen usually has two steps. First, the vector of principal components is found and the components are standardized tohave zero means and unit variances, and second, the vector is further rotated so that the new components maximize aselected measure of non-Gaussianity. It is then argued that the components obtained in this way are made as independentas possible or that they display the components with maximal information. [12] for example give a heuristic argumentthat, according to the central limit theorem, weighted sums of independent non-Gaussian random variables are closerto Gaussian than the original ones. In this paper, we discuss and clarify the somewhat vague connections betweennon-Gaussianity, independence and notions of information in the context of the independent component analysis.In Section 2 we ﬁrst introduce descriptive measures for location, dispersion, skewness and kurtosis of univariaterandom variables with some discussion of corresponding partial orderings. In this part of the paper we assume thatthe considered univariate random variable x has a ﬁnite mean E ( x ) and variance V ar ( x ) , cumulative distributionfunction F and continuously differentiable probability density function f . Skewness, kurtosis and other cumulants ofthe standardized variable ( x − E ( x )) / (cid:112) V ar ( x ) are often used to measure non-Gaussianity of the distribution of x .The most popular measures of statistical information are the differential entropy H ( f ) = − (cid:82) f ( x ) log( f ( x )) dx andthe Fisher information in the location model, that is, J ( f ) = (cid:82) f ( x )[ f (cid:48) ( x ) /f ( x )] dx . These and other informationmeasures with related partial orderings and their use as measures of non-Gaussianity are discussed in the later part ofSection 2.The multivariate independent components model is discussed in Section 3. It is then assumed that, for a p -variate randomvector x , there is a linear operator A ∈ R p × p such that Ax has independent components. Under certain assumptions,the projection pursuit approach can be used to ﬁnd the rows of A one-by-one and various information measures as wellas cumulants have been used as projection indices. In Section 3 the connections between non-Gaussianity, independenceand information in this context is discussed in detail. The paper ends with some ﬁnal remarks in Section 4. a r X i v : . [ m a t h . S T ] J un PREPRINT - J

UNE

22, 2020

We consider a continuous random variable x with the ﬁnite mean E ( x ) , ﬁnite variance V ar ( x ) , density function f and cumulative density function F . Location, dispersion, skewness and kurtosis are often considered by deﬁning thecorresponding measures or functionals for these properties. Location and dispersion measures, write T ( x ) and S ( x ) ,are functions of the distribution of x and deﬁned as follows. Deﬁnition 2.1. T ( x ) ∈ R is a location measure if T ( ax + b ) = aT ( x ) + b , for all ∀ a, b ∈ R .2. S ( x ) ∈ R + is a dispersion measure if S ( ax + b ) = | a | S ( x ) , for all ∀ a, b ∈ R . Clearly, if T is a location measure and x is symmetric around µ , then T ( x ) = µ for all location measures. For squareddispersion measures S , [10] considered the concepts of additivity, subadditivity and superadditivity. These conceptsappear to be crucial in developing tools for the independent component analysis and are deﬁned as follows. Deﬁnition 2.2.

Let S be a squared dispersion measure.1. S is additive if S ( x + y ) = S ( x ) + S ( y ) for all independent x and y .2. S is subadditive if S ( x + y ) ≤ S ( x ) + S ( y ) for all independent x and y .3. S is superadditive if S ( x + y ) ≥ S ( x ) + S ( y ) for all independent x and y . The mean E ( x ) and the variance V ar ( x ) are important and most popular location and squared dispersion measures.It is well known that V ar ( x + y ) = V ar ( x ) + V ar ( y ) for independent x and y , and E ( x + y ) = E ( x ) + E ( y ) istrue even for dependent x and y . These additivity properties are highly important in certain applications and in factcharacterize the mean and variance among continuous measures as follows. Theorem 2.1.

1. Let a location measure T be additive and continuous at N (0 , , that is, z n → d z ∼ N (0 , implies that T ( z n ) → T ( z ) = 0 . Then T ( x ) = E ( x ) for all x with ﬁnite second moments.2. Let a squared dispersion measure S be additive and continuous at N (0 , , that is, z n → d z ∼ N (0 , implies that S ( z n ) → S ( z ) > . Then S ( x ) = S ( z ) V ar ( x ) for all x with ﬁnite second moments. Comparison of different location measures T and T and dispersion measures S and S , provides measures ofskewness and kurtosis as Sk( x ) = T ( x ) − T ( x ) S ( x ) and Ku( x ) = S ( x ) S ( x ) . Classical measures of skewness and kurtosis proposed in the literature can be written in this way. Note that bothmeasures are afﬁne invariant in the sense that

Sk( ax + b ) = sgn ( a )Sk( x ) and Ku( ax + b ) = Ku( x ) . If x has a symmetric distribution, then Sk( x ) = 0 . In the literature, kurtosis measures are thought to measure thepeakedness and/or the heaviness of the tails of the density of x but, as we will see in Section 2.3, Ku( x ) as deﬁned heremay be a global measure of deviation from the normality and have also been used as an afﬁne invariant informationmeasure for some special choices of the dispersion measures S and S .Moment and cumulant generating functions deﬁned as E (cid:2) e tx (cid:3) = ∞ (cid:88) k =0 µ k t k /k ! and log E (cid:2) e tx (cid:3) = ∞ (cid:88) k =0 κ k t k /k ! respectively, generate classical measures, i.e., moments E ( x ) = µ ( x ) and V ar ( x ) = µ ( x − µ ( x )) and cumulants κ ( x st ) and κ ( x st ) where x st = ( x − E ( x )) / (cid:112) V ar ( x ) . The cumulants κ k , k = 1 , , ... are additive as log E [ e tx ] is additive, and κ /kk ( x − E ( x )) , k = 2 , , ... are subadditive squared dispersion measures which follows from theMinkowski inequality, see [10]. Another class of measures is given by the quantiles q u = F − ( u ) , < u < , withcorresponding measures such as q / , q − u − q u , q u + q − u − q / q − u − q u , and q − u − q u q − v − q v , < u < v < . PREPRINT - J

UNE

22, 2020These quantile based measures provide robust alternatives to moment based measures. To our knowledge, they howeverlack the additivity properties stated in Deﬁnition 2.2 which makes them unsuitable for usage in the independentcomponent analysis.An alternative strategy to consider the properties of distributions is to deﬁne partial orderings for location, dispersion,skewness and kurtosis. For continuous x and y with cumulative distribution functions F and G , write ∆( x ) = G − ( F ( x )) − x . The function ∆( x ) is called a shift function of x as x when shifted by ∆( x ) and has the distributionof y . The transformation x (cid:55)→ x + ∆( x ) is also known as the (univariate) Monge-Kantorovich optimal transport map.Using function ∆ we can naturally deﬁne the following partial orderings [3, 4, 36, 25].1. Location ordering: ∆ is positive.2. Dispersion ordering: ∆ is increasing. 3. Skewness ordering: ∆ is convex.4. Kurtosis ordering: ∆ is concave-convex.[3, 4, 25] then stated that, in addition to the afﬁne equivariance and invariance properties, the measures of location,dispersion, skewness and kurtosis should be monotone with respect to corresponding orderings. For ﬁnding monotonemeasures in the dispersion case, for example, ∆ is increasing if and only if E [ C ( x − E ( x ))] ≤ E [ C ( y − E ( y ))] for all convex C .which is also called the dilation order. It implies for example that the measures ( E [ | x − E ( x ) | k ]) /k , k > , aremonotone dispersion measures. Consider a discrete random variable with k possible values (‘alphabets’) with probabilities listed in p = ( p , ..., p k ) .Write p (1) ≤ ... ≤ p ( k ) for the ordered probabilities. It is sometimes presumed that a distribution p is informative if itcan provide ‘surprises’ with very small p i ’s. On the other hand, people often claim that p is informative if the result ofthe experiment is known with a high probability, that is, if only one or few values have high p i ’s. These somewhat naivecharacterizations suggest the following well-known partial ordering for discrete distributions [19]. Deﬁnition 2.3.

Majorization : p ≺ q if j (cid:88) i =1 p ( i ) ≥ j (cid:88) i =1 q ( i ) , j = 1 , ..., k , and then p is said to be majorized by q . Majorization is nothing but a dispersion ordering (and a dilation order) for the discrete distributions with k equiprobablevalues p , ..., p k in [0 , with mean /k . Then, according to [27], p ≺ q ⇔ p = q L with some doubly stochastic matrix L ⇔ (cid:80) ki =1 C ( p i ) ≤ (cid:80) ki =1 C ( q i ) for all continuous convex C .The doubly stochastic matrix L is a matrix with non-negative elements such that all row sums and all column sums areone. The doubly stochastic operator L is then in fact a convex combination of permutations; p is obtained from q bythis ‘smoothing’ and is therefore less informative. Further, for all p , (1 /k, ..., /k ) ≺ p ≺ (0 , ..., , and, for simple mixtures, p ≺ q ⇒ p ≺ λp + (1 − λ ) q ≺ q, ≤ λ ≤ . We can now give the following.

Deﬁnition 2.4.

Let p = ( p , ..., p k ) list the probabilities of k possible values of a discrete random variable, thatis, p , ..., p k ∈ [0 , , (cid:80) ki =1 p i = 1 . A measure M ( p ) is a information measure if it is monotone with respect tomajorization. Note that, as ( p , ..., p k ) ≺ ( p (1) , ..., p ( k ) ) ≺ ( p , ..., p k ) , the deﬁnition implies that the information measures areinvariant under permutations of the probabilities in ( p , ..., p k ) . The equivalent conditions for majorization then suggestquantities such as H ( p ) = − k (cid:88) i =1 log( p i ) p i , H ∗ ( p ) = k (cid:88) i =1 p i and H ∗∗ ( p ) = p ( k ) and − H , H ∗ and H ∗∗ are monotone information measures that easily extend to continuous and multivariate cases. The Shannon’s entropy [30] − (cid:80) ki =1 log ( p i ) p i is often seen as a measure of ability to compress the data (e.g. lower boundfor the expected number of bits to store the data). 3 PREPRINT - J

UNE

22, 2020

Consider next a continuous random variable x with the continuously differentiable probability density function f andﬁnite variance V ar ( x ) . The three measures from the discrete case straightforwardly extend in the continuous case to H ( x ) = − E [log f ( x )] = − (cid:90) ∞−∞ f ( x ) log f ( x ) dx,H ∗ ( x ) = E [ f ( x )] = (cid:90) ∞−∞ f ( x ) dx, and H ∗∗ ( x ) = sup x f ( x ) = f ( x mode ) , if the mode x mode exists . The Fisher information in the location model f ( · − µ ) at µ = 0 given by J ( x ) = (cid:90) ∞−∞ f ( x ) (cid:18) f (cid:48) ( x ) f ( x ) (cid:19) dx. is also often used as an information measure [16].The measure H ( x ) is popular in the literature and known as the differential entropy . Under certain restrictions, themeasure has the following maximizers [7]. For the distributions on R with a ﬁxed variance, H ( x ) is maximized if x hasa normal distribution. For distributions on R + with a ﬁxed mean, H ( x ) is maximized at the exponential distribution.For distributions on a ﬁnite interval, H ( x ) is maximized at the uniform distribution on that interval. Note that, in theBayesian analysis, these three distributions are often used as priors that reﬂect ‘total ignorance’.We next show that the three straightforward extensions H , H ∗ and H ∗∗ as well as the Fisher information J providesquared dispersion measures as in Deﬁnition 2.1 but with an interesting additional invariance property. First note thatthe measures are invariant under location shift of the distribution but not under rescaling of the variable. Recall thatinformation as stated for discrete distributions is invariant under the permutations of the probabilities in ( p , ..., p k ) . Allpermutations consist of successive pairwise exchanges of two probabilities. In the continuous case, similar elementalprobability density transformations may be constructed as follows. For all a < a + ∆ < b < b + ∆ and density function f , write f a,b, ∆ ( x ) = (cid:40) f ( x ) , x ∈ R − [ a, a + ∆] − [ b, b + ∆] f ( b + ( x − a )) , x ∈ [ a, a + ∆] f ( a + ( x − b )) , x ∈ [ b, b + ∆] The transformation allows the manipulation of the properties of the distribution in many ways. The transformationcan for example be used to move some probability mass from the centre of distribution to the tails and in this way tomanipulate the variance and the kurtosis of the distribution for example. As far as we know, this transformation has notbeen discussed in the literature. It is surprising that the information measures H , H ∗ , H ∗∗ and J provide dispersionmeasures which are invariant under these transformations. Theorem 2.2.

The entropy power e H ( x ) and measures [ H ∗ ( x )] − , [ H ∗∗ ( x )] − and [ J ( x )] − are squared disper-sion measures that are invariant under the transformations f → f a,b, ∆ . The measures e H ( x ) and [ J ( x )] − aresuperadditive. We now further discuss the properties of the dispersion measures in Theorem 2.2 and, to ﬁnd afﬁne invariant informationmeasures, consider the ratios of the variance to these squared dispersion measures. The ratio of the variance to theentropy power, that is,

V ar ( x ) e − H ( x ) is minimized at the normal distribution [7]. In a neighbourhood of a normaldistribution the negative entropy − H ( x ) has an interesting approximation using third and fourth cumulants. [13]showed that the negative differential entropy for the density f ( x ) = ϕ ( x )(1 + (cid:15) ( x )) where ϕ is the density of N (0 , and (cid:15) is a well-behaved “small” function that satisﬁes E [ (cid:15) ( z ) z k ] = 0 , z ∼ N (0 , , k = 0 , , , can be approximatedby (1 / (cid:82) ϕ ( x ) (cid:15) ( x ) dx ≈ ( κ ( x ) + (1 / κ ( x )) / .Next, [ H ∗ ( x )] − is a (squared) dispersion measure, and therefore [ H ∗ ( x )] V ar ( x ) provides an afﬁne invariantinformation measure. For symmetric distributions, it preserves the concave-convex kurtosis ordering of van Zwet and H ∗ ( x )] V ar ( x ) is in fact the efﬁciency of the Wilcoxon rank test with respect to the t -test. Also, for symmetricdistributions, H ∗∗ ( x )] V ar ( x ) is a kurtosis measure in the van Zwet sense and simultaneously the efﬁciency of thesign test with respect to the t -test. We also mention that, if Q ( x ) = E (cid:2) f ( F − ( u )) /ϕ (Φ − ( u )) (cid:3) with u ∼ U (0 , ,then Q − ( x ) is a squared dispersion measure and Q ( x ) V ar ( x ) is the efﬁciency of the van der Waerden test with4 PREPRINT - J

UNE

22, 2020respect to the t -test in the symmetric case. By the Chernoff-Savage theorem, it attains its minimum 1 at the normaldistribution. See [6, 9].Finally, the information measure V ar ( x ) J ( x ) ≥ is minimized at the normal distribution. In the location estimationproblem in the symmetric case, V ar ( x ) J ( x ) is also the asymptotic relative efﬁciency of the maximum likelihoodestimate of the symmetry centre with respect to the sample mean [29]. We next outline how to construct partial orderings for information in the univariate continuous case. Let ﬁrst x bea continuous random variable with density f on (0 , . If m ( y ) = µ { u : f ( u ) > y } where µ is Lebesgue measure,then the function f ↓ ( u ) = sup { y : m ( y ) > u } , u ∈ (0 , , provides the decreasing rearrangement of f . Note thatany density function on (0 , can be approximated by a simple density function f ( x ) = (cid:80) ki =1 α i χ A j ( x ) , where α < α < · · · < α k , and A , ..., A k are disjoint Lebesque-measurable sets on (0 , and χ A is the characteristicfunction of set A . Then m ( y ) = k (cid:88) i =1 β i χ B i ( y ) and f ↓ ( u ) = k (cid:88) i =1 α i χ [ β i − ,β i ) ( u ) , where β i = (cid:80) ij =1 µ ( A j ) , B i = [ α i +1 , α i ) for i = 1 , , . . . , k , and β = α k +1 = 0 . For a better insight, see Figure 1.For more details and examples, see e.g. [15]. A2 A1 A30 a a a xf b b b a a a ym b b b a a a uf ﬂ Figure 1: Simple function f (left), its distribution function m (middle) and decreasing rearrangement f ↓ (right)Using the decreasing rearrangement we can give the following deﬁnitions. Deﬁnition 2.5.

Let f and g be density functions on the interval (0 , . Then g has more information than f , write f ≺ g , if (cid:90) u f ↓ ( v ) dv ≤ (cid:90) u g ↓ ( v ) dv, for all u ∈ (0 , Deﬁnition 2.6.

Let F (0 , be the set of density functions f on the interval (0 , . Then M (0 , : F (0 , → R is aninformation measure if it is monotone with respect to the partial ordering in Deﬁnition 2.5 The distribution with minimum information is the uniform distribution on (0 , . Information measures are easily found,see [28], as f ≺ g if and only if (cid:90) C ( f ( u )) du ≤ (cid:90) C ( g ( u )) du for all continuous convex functions C [28] also discusses how to construct linear operators L for which f = Lg ≺ g when f ≺ g .5 PREPRINT - J

UNE

22, 2020Consider next a continuous random variable x on R with pdf f . To ﬁnd a location and scale-free version of the density,[31] proposed the transformation f ( x ) , x ∈ R → f ∗ ( u ) = f ( F − ( u )) H ∗ ( x ) , u ∈ (0 , . Then f ∗ , called the probability density quantile (pdQ) , is a probability density function on (0 , which is invariantunder linear transformations of the original variable x [31]. It is also true that, for given f ∗ , the original f is known uplocation and scale. Using this density transformation, the deﬁnition of an invariant information measure for densities on R can be given as follows. Deﬁnition 2.7.

Let F R be a set of density functions f on R and let M (0 , : F (0 , → R be an information measure fordistributions on (0 , . Then M R : f → M (0 , ( f ∗ ) is an information measure in the set F R . Note that M R is not an extension of M (0 , meaning that, f ∈ F (0 , does not imply that M R ( f ) = M (0 , ( f ) . M R isinvariant under rescaling of f while M (0 , is not.Applying Deﬁnition 2.7 and choosing convex C ( u ) = − log ( u ) and C ( u ) = log ( u ) u , we get location and scaleinvariant information measures for f such as exp {− (cid:90) log( f ∗ ( u )) du } = e H ( x ) [ H ∗ ( x )] and exp { (cid:90) log( f ∗ ( u )) f ∗ ( u ) du } = e − H ( f /H ∗ ( x )) [ H ∗ ( x )] − , which attain their minimum at the uniform distribution and are invariant under the transformations f → f a,b, ∆ . Formore details see e.g. [32].To replace the transformation f → f ∗ by a transformation to densities on (0 , for which minimum information isattained at any density g , one can use the following adjustment. Theorem 2.3.

Let x and y be random variables on R with the probability density functions f and g and cumulativedistribution functions F and G , respectively. Then f : g ( u ) = f ( G − ( u )) g ( G − ( u )) is a density function on (0 , and its differential entropy is − H ( f : g ) ≥ is the Kullback-Leibler (KL) divergencebetween the distributions of x and y . Let again x have a density f and let ϕ and Φ be the pdf and the cdf of a normal distribution with mean E ( x ) andvariance V ar ( x ) . Then one can show, using similar arguments as in [31], that f : ϕ ( u ) = f (Φ − ( u )) ϕ (Φ − ( u )) , u ∈ (0 , is a location and scale-free density and information measures in Deﬁnition 2.6 applied to the set of densities ˜ f = f : ϕ attain their minimums when f has a normal distribution. A collection of information measures is given by (cid:82) C ( ˜ f ( u )) du with continuous and convex functions C and then we get for example again exp { (cid:90) log ( ˜ f ( u )) ˜ f ( u ) du } = (2 πe ) e − H ( x ) V ar ( x ) . We next provide examples on the probability density functions f , f ∗ and ˜ f when f is the density of Gaus-sian, Laplace, Lognormal and Uniform distributions. Also a mixture of two Gaussian distributions denoted by GM M ( µ , µ , σ , σ , w ) is considered with the densities wϕ µ ,σ ( x ) + (1 − w ) ϕ µ ,σ ( x ) , ≤ w ≤ . Figure 2 thenshows the impact of the transformations f → f ∗ and f → ˜ f in these cases.Distribution e H ( f ) e H ( f ∗ ) e H ( ˜ f ) H ∗ ( f ) − H ∗ ( f ∗ ) − H ∗ ( ˜ f ) − N(0,1) 17.079 0.824 1.000 12.566 0.750 1.000Laplace(1) 29.556 0.680 0.887 16.000 0.719 0.783Lognormal(0,1) 17.079 0.642 0.308 7.622 0.537 0.186U(0,1) 1.000 1.000 0.703 1.000 1.000 0.567GMM( , , , , . ) 100.000 0.862 0.855 78.000 0.792 0.756Table 1: The power entropy and the [ H ∗ ] − measure for some continuous distributions and their transformations.6 PREPRINT - J

UNE

22, 2020Table 1 provides for the same distributions the power entropies e H ( · ) and H ∗ ( · ) − for f , f ∗ and ˜ f . Note that theinformation measures applied to f are not invariant under rescaling of x as opposed to f ∗ and ˜ f . For example for thesettings we use in the Table 1, the normal and lognormal densities have the same power entropy just by accident and theequality is not generally true. −4 −2 0 2 4 . . . . . f N o r m a l . . . . f * . . . f~ . . . . . U n i f o r m . . . . . . . . −4 −2 0 2 4 . . . . . . l a c e . . . . . . . . −4 −2 0 2 4 . . . . r m a l . . . . . . . . G au ss i an m i x t u r e −3 0 3 6 . . . . . . . . Figure 2: Comparison of f , f ∗ and ˜ f for ﬁve distributions.7 PREPRINT - J

UNE

22, 2020For better understanding on the measures, we illustrate the behavior of e H ( · ) and H ∗ ( · ) − in the GMM model with fourﬁxed and one varying parameter, each in turn. In Figure 3 both information measure curves are plotted in the sameﬁgures to compare the shapes of the curves as well as the occurrences of extreme values. The curves for ˜ f with varyinglocation and scale seem natural as minimum information is attained as GMM gets “closer” to the normal distribution.Results for f ∗ and varying location seem strange in a sense where one would expect decreasing behaviour of bothmeasures as the distance in means increases, as it is case for ˜ f , while the result for f in all three cases could simply beexplained with decrease in information as a result of increase in overall variance of the mixture. e H ( · ) and H ∗ ( · ) − seem to behave almost proportionally in all cases. In cases of f ∗ and ˜ f where the majorization is well deﬁned, suchbehaviour is indeed expected, as the reciprocals of both e H ( · ) and H ∗ ( · ) − are information measures for both f ∗ and ˜ f .However, further investigations into this matter will be conducted in the future. In this section we consider multivariate random variables. For a p -variate random vector x with ﬁnite second moments,the mean vector and covariance matrix are E ( x ) ∈ R p and Cov ( x ) ∈ R p × p , respectively. Let Cov ( x ) = UDU (cid:48) be the eigenvector-eigenvalue decomposition of the covariance matrix. Then

Cov ( x ) − / := UD − / U (cid:48) and x st = Cov ( x ) − / ( x − E ( x )) standardizes x , that is, E ( x st ) = and Cov ( x st ) = I p . The set of p × r , r ≤ p , matriceswith orthonormal columns is denoted by O p × r . Thus U ∈ O p × r implies U (cid:48) U = I r . The set of p × p diagonalmatrices with positive diagonal elements is denoted by D p × p . If U ∈ O p × p and D ∈ D p × p then x → Ux and x → Dx , x ∈ R p , are a rotation operator and a componentwise rescaling operator, respectively. Let A ∈ R p × q be amatrix with rank r ≤ min { p, q } . Then the linear operator A may be written as (singular value decomposition, SVD) A = UDV (cid:48) = (cid:80) ri =1 d i u i v (cid:48) i where U = ( u , ..., u r ) ∈ O p × r , V = ( v , ..., v r ) ∈ O q × r , and D ∈ D r × r . Let x be a p -variate vector with the full-rank covariance matrix Cov ( x ) . We say that x has a spherical distribution ifthere exists a µ such that ( x − µ ) ∼ U ( x − µ ) for all orthogonal U . In the following we ﬁrst deﬁne the elliptic andindependent components distributions (see for example [23, 24] for more details). Deﬁnition 3.1.

Let x ∈ R p be a p -variate random vector.1. x has an elliptical distribution if there exists a nonsingular A ∈ R p × p such that Ax has a sphericaldistribution.2. x has an independent components distribution if there exists a nonsingular A ∈ R p × p such that Ax hasindependent components. We next provide some results on how the matrix A can be found in different cases. Theorem 3.1.

Let x be a p -variate random vector with a full-rank covariance matrix Cov ( x ) = UDU (cid:48) . Then wehave the following.1. [ VD − / U (cid:48) ] x has uncorrelated components for all orthogonal V .2. If x has an elliptical distribution, [ VD − / U (cid:48) ] x has a spherical distribution for all orthogonal V .3. If x has an independent components distribution, [ VD − / U (cid:48) ] x has independent components for somechoice(s) of orthogonal V .4. If x has both an elliptical distribution and an independent component distribution then [ VD − / U (cid:48) ] x hasindependent Gaussian components for all orthogonal V , that is, x has a multivariate Gaussian distribution. Let x have an independent components distribution such that z = Ax + b is standardized ( E ( z ) = and Cov ( z ) = I p )and has independent components. Theorem 3.1 then implies that A = V (cid:48) Cov ( x ) − / where the rotation matrix V can be chosen as V = ( V , V ) separating non-Gaussian independent components in V (cid:48) Cov ( x ) − / x and Gaussianindependent components in V (cid:48) Cov ( x ) − / x . Note that V is only unique up to right multiplication by an orthogonal8 PREPRINT - J

UNE

22, 2020 (a) GMM( , µ , , , . ) where µ varies. m H * (f) −2 m m (b) GMM( , , , σ , . ) where σ varies. s H * (f) −2 s s (c) GMM( , , , , w ) where w varies. w20253035 0 0.2 0.4 0.6 0.8 1 152025e H * (f) −2 w0.800.810.820.830.840.850.860.87 0 0.2 0.4 0.6 0.8 1 0.720.740.760.780.800.82 w0.9700.9750.9800.9850.9900.9951.000 0 0.2 0.4 0.6 0.8 1 0.950.960.970.980.991.00 Figure 3: Power entropy and [ H ∗ ] − for different GMMs when always one parameter varies. The left vertical axiscorresponds to power entropy and the right axis to [ H ∗ ] − . The left panel gives the measures for f , the middle for f ∗ and the right for ˜ f .matrix. A generally accepted strategy is to ﬁnd V = ( v , ..., v q ) ∈ O p × q such that the components of V (cid:48) x st areas ‘non-Gaussian as possible’. The Gaussian part V (cid:48) Cov ( x ) − / x is thought to be just the noise part and, for othercomponents, it is argued that the sum of independent random variables is ‘more Gaussian’ than the original variables.The noise interpretation of the Gaussian part may be motivated by the following. A random vector has a multivariatenormal distribution if and only if all linear combinations of the marginal variables have univariate normal distributions,that is, there are no ‘interesting’ directions. The normal distribution is the only distribution for which all third andhigher cumulants are zero. As seen before, a Gaussian distribution is the distribution with the poorest informationamong distributions with the same variance (highest entropy, smallest Fisher information). For a thorough discussionof Gaussian distributions, see [14]. 9 PREPRINT - J

UNE

22, 2020Let D ( x ) then be the projection index , i.e., the functional that is used to measure non-Gaussianity. In the one-by-one projection pursuit approach the ﬁrst direction v ( v (cid:48) v = 1 ) maximizes D ( v (cid:48) x st ) , the second direction v isorthogonal to v ( v (cid:48) v = 1 , v (cid:48) v = 0 ) and maximizes D ( v (cid:48) x st ) and so on. After ﬁnding v , ..., v j − , we optimizethe Lagrangian function L ( v ; λ j , ..., λ jj ) = D ( v (cid:48) x st ) − λ jj ( v (cid:48) v − − j − (cid:88) i =1 λ ji v (cid:48) v i . Then v j solves the (estimating) equation ( I p − (cid:80) j − i =1 v i v (cid:48) i ) T ( v ) = ( T ( v ) (cid:48) v ) v , where T ( v ) = ∂D ( v (cid:48) x st ) /∂ v . Fromthe computational point of view, this suggests a ﬁxed-point algorithm . The estimation equation also provides a way toﬁnd the limiting distribution of the estimate, since the estimate is obtained when the theoretical multivariate distributionis replaced by the empirical one. See for example [20, 21, 22] and references therein for more details.The following questions naturally arise. How should one choose the projection index D ( x ) to ﬁnd the independentcomponents? Are the independent components provided by the most informative directions as has been often stated inthe literature? These questions are partially answered by the following. Theorem 3.2.

Let z = Ax + b = ( z , ..., z p ) (cid:48) be the vector of standardized independent components.1. Let D ( x ) be a subadditive squared dispersion measure.Then D ( v (cid:48) x st ) ≤ max j D ( z j ) .2. Let D ( x ) be a superadditive squared dispersion measure.Then D ( v (cid:48) x st ) ≥ min j D ( z j ) . Based on Theorem 3.2 and the discussion above we can now end the paper with the following conclusions. If D ( x ) is subadditive then it can be used as a projection index. For example the cumulants κ / (2 k +1)2 k +1 ( x ) and κ / (2 k +2)2 k +2 ( x ) , k = 1 , , . . . , when calculated for standardized distributions, provide squared dispersion measures that are subadditive.Therefore they can be used as projection indices. For superadditive D ( x ) , the functional ( D ( x )) − is a valid projectionindex as ( D ( v (cid:48) x st )) − ≤ max j ( D ( z j )) − . As seen before, entropy power e H ( x ) and the inverse of Fisher information, J − ( x ) are superadditive squared dispersion measures. Note that in both cases D ( v (cid:48) x st ) is in fact a ratio of twosquared dispersion functions, and the projection index measures deviation from Gaussianity using a skewness, kurtosisor information measure. As mentioned in Section 3.3, ( κ ( x ) + (1 / κ ( x )) / provides an approximation of negativedifferential entropy in a neighborhood of Gaussian distribution and is a valid projection index as well. For furtherdiscussion, see [10]. Note also that one of the most popular ICA procedures in the engineering community, the so called fastICA , uses a projection index of the form D ( x ) = | E [ C ( x )]) | where C is such a function that, if z ∼ N (0 , then E [ C ( z )] = 0 . Examples of valid choices of C are C ( z ) = z and C ( z ) = z − providing again the third and fourthcumulants, respectively. The usage of various information criteria is popular in independent component analysis. The connections betweennotions of information and statistical independence and the special role of the Gaussian distribution were discussedin detail in the paper. We also introduced new ideas and partial orderings for information which utilize transformedlocation and scale-free probability density functions. In independent component analysis with unknown marginaldensities, the estimation of the value of the adapted information measure in a given direction is highly challenging andit has to be done again and again when applying the ﬁxed point algorithm for the correct direction. Substantial researchis therefore still needed for these tools to be of practical value.

Proof of Theorem 2.1.

Let x , ..., x n be a random sample from a distribution of x with the mean value E ( x ) andvariance V ar ( x ) . By the central limit theorem, z n = 1 √ n n (cid:88) i =1 x i − E ( x ) (cid:112) V ar ( x ) → d z ∼ N (0 , . Therefore, by additivity and afﬁne equivariance, T ( z n ) = (cid:114) nV ar ( x ) ( T ( x ) − E ( x )) → and S ( z n ) = S ( x ) V ar ( x ) → S ( z ) PREPRINT - J

UNE

22, 2020and the result follows. For similar results in the multivariate case, see [34].

Proof of Theorem 2.2.

The invariances of the measures H ( x ) , H ∗ ( x ) , H ∗∗ ( x ) and J ( x ) under location shifts f ( x ) → f ( x + b ) , sign change f ( x ) → f ( − x ) as well as under f → f a,b, ∆ follow easily from their deﬁnitionsand from the deﬁnition of the Riemann integral. We therefore only have to consider rescaling f ( x ) → (1 /a ) f ( x/a ) with a > . Then H ( ax ) = − (cid:82) (1 /a ) f ( x/a ) log((1 /a ) f ( x/a )) dx = − (cid:82) f ( x ) log((1 /a ) f ( x )) dx = H ( x ) + log( a ) and therefore e H ( ax ) = a e H ( x ) . In a similar way one can show that [ H ∗ ( ax )] − = a [ H ∗ ( x )] − . Also easily [ H ∗∗ ( ax )] − = a [ H ∗∗ ( x )] − . As f (cid:48) ( x ) → (1 /a ) f (cid:48) ( x/a ) one also easily shows that [ J ( ax )] − = a [ J ( x )] − .Thus all the four measure are scale equivariant and therefore squared dispersion measures. Proof of Theorem 2.3. f : g is indeed a density function since it is trivially nonnegative and (cid:82) ( f : g )( u )d u = (cid:82) f ( G − ( u )) g ( G − ( u )) d u = (cid:82) ∞−∞ f ( x )d x = 1 with the substitution x = G − ( u ) . Similary, − H ( f : g ) = (cid:82) ( f : g )( u ) log(( f : g )( u ))d u = (cid:82) ∞−∞ f ( x ) log f ( x ) g ( x ) d x = D ( f || g ) . Proof of Theorem 3.1. (1) Let V be orthogonal. As Cov ([ VD − / U (cid:48) ] x ) = VD − / U (cid:48) Cov ( x ) UD − / V (cid:48) = VV (cid:48) = I p , the components of [ VD − / U (cid:48) ] x ) are uncorrelated. (2) Assume that Ax is spherical with A = VCW (cid:48) rescaled so that

Cov ( Ax ) = I p . As A Cov ( x ) A (cid:48) = I p , Cov ( x ) = ( A (cid:48) A ) − and WC − W (cid:48) = UDU (cid:48) . Therefore W = U and C = D − / and we can conclude that [ VD − / U (cid:48) ] x is spherical for any orthogonal V . (If x isspherical then Vx is spherical for all orthogonal V .) (3) Let Ax with A = VCW (cid:48) have independent and standardizedcomponents so that

Cov ( Ax ) = I p . As in (2), A must be VD − / U (cid:48) but now for some V only. (It is not true that if x has independent standardized components then Vx has independent components for any choice of V .) (4) Based on (2)and (3), there exist an A = VD − / U (cid:48) such that Ax has a spherical distribution with independent components. Thenby the Maxwell-Hershell theorem, Ax has a multivariate normal distribution. For the proof of the Maxwell-Hershelltheorem, see e.g. Proposition 4.11. in [5]. Proof of Theorem 3.2.

Let z = Ax + b = ( z , ..., z p ) (cid:48) be a vector of standardized independent components.By Theorem 3.1, z = Vx st with some orthogonal V . If u (cid:48) u = 1 then also ( Vu ) (cid:48) ( Vu ) = 1 and therefore D ( u (cid:48) x st ) = D ( u (cid:48) Vz ) ≤ (cid:80) ( V (cid:48) u ) i D ( z i ) ≤ max j D ( z j ) for subadditive squared dispersion measure D and D ( u (cid:48) x st ) = D ( u (cid:48) Vz ) ≥ (cid:80) ( V (cid:48) u ) i D ( z i ) ≥ min j D ( z j ) for superadditive squared dispersion measure D . The work of KN has been supported by the Austrian Science Fund (FWF) Grant number P31881-N32.

References [1]

A. R. Barron : Entropy and the central limit theorem. Ann. Probab. (1986), 336–342.[2] A. J. Bell, T. J. Sejnowski : An information-maximization approach to blind separation and blind deconvolution. NeuralComput. (1995), 1129–1159.[3] P. J. Bickel, E. L. Lehmann : Descriptive statistics for nonparametric models II: Location. Ann. Stat. (1975), 1045–1069.[4] P. J. Bickel, E. L. Lehmann : Descriptive statistics for nonparametric models III: Dispersion. Ann. Stat. (1976), 1139–1158.[5] M. Bilodeau, D. Brenner : Theory of multivariate statistics. Springer Texts in Statistics. New York: Springer (1999).[6]

H. Chernoff, I. R. Savage : Asymptotic normality and efﬁciency of certain nonparametric test statistics. Ann. Math. Stat. (1958), 972–994.[7] T. Cover, J. Thomas : Elements of information theory. New York: John Wiley & Sons. (1991).[8]

L. Faivishevsky, J. Goldberger : ICA based on a smooth estimation of the differential entropy. Advances in Neural InformationProcessing Systems (2008), 433–440.[9] J. L. Hodges, E. L. Lehmann : The efﬁciency of some nonparametric competitors of the t-test. Ann. Math. Stat. (1956),324–335.[10] P. J. Huber : Projection pursuit. Ann. Stat. (1985), 435–475.[11] A. Hyvärinen : New approximations of differential entropy for independent component analysis and projection pursuit.Advances in Neural Information Processing Systems, (1998), 273–279.[12] A. Hyvärinen, J. Karhunen, E. Oja : Independent component analysis. John Wiley & Sons, New York (2001) PREPRINT - J

UNE

22, 2020 [13]

M. C. Jones, R. Sibson : What is projection pursuit? J. R. Stat. Soc., Ser. A 150, (1987), 1–36.[14]

K. Kim, G. Shevlyakov : Why Gaussianity? IEEE Signal Process. Mag. (2008), 102–113.[15] E. Kristiansson : Decreasing Rearrangement and Lorentz L(p,q) Spaces (Thesis). Department of Mathematics of the LuleaUniversity of Technology, (2002). Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.111.1244&rep=rep1&type=pdf .[16]

S. Kullback : Information theory and statistics. John Wiley and Sons, Inc., New York; Chapman and Hall, Ltd., London 1959.[17]

E. G. Learned-Miller, J. W. Fisher III : ICA using spacings estimates of entropy. J. Mach. Learn. Res. (2004), 1271–1295.[18] B. G. Lindsay, W. Yao : Fisher information matrix: A tool for dimension reduction, projection pursuit, independent componentanalysis, and more. Can. J. Statistics (2012), 712–730.[19] A. W. Marshall, I. Olkin : Inequalities: Theory of majorization and its applications. Mathematics in Science and Engineering,Vol. 143. Academic Press, New York, 1979.[20]

J. Miettinen, K. Nordhausen, H. Oja, S. Taskinen : Deﬂation-based FastICA with adaptive choices of nonlinearities. IEEETrans. Signal Process. (2014), 5716–5724.[21] J. Miettinen, K. Nordhausen, H. Oja, S. Taskinen : Fourth moments and independent component analysis. Stat. Sci. (2015),372–390.[22] J. Miettinen, K. Nordhausen, H. Oja, S. Taskinen, J. Virta : The squared symmetric fastICA estimator. Signal Process. (2017), 402–411.[23]

K. Nordhausen, H. Oja : Independent component analysis: a statistical perspective. Wiley Interdiscip. Rev. Comput. Stat. (2018), e1440.[24] K. Nordhausen, H. Oja : Robust nonparametric inference. Annu. Rev. Stat. Appl. (2018), 473–500.[25] H. Oja : On location, scale, skewness and kurtosis of univariate distributions. Scand. J. Stat. (1981), 154–68.[26] E. Parzen : Quantile probability and statistical data modeling. Statist. Sci. (2004), 652–662.[27] J. E. Peˇcari´c, F. Proschan, Y. L. Tong : Convex functions, partial orderings, and statistical applications. Mathematics inScience and Engineering, 187. Academic Press, Boston, 1992.[28]

J. V. Ryff : On the representation of doubly stochastic operators. Paciﬁc J. Math. (1963), 1379–1386.[29] R. Serﬂing : Asymptotic relative efﬁciency in estimation. International Encyclopedia of Statistical Science. Springer (2011),68–72.[30]

C. E. Shannon : A mathematical theory of communication. The Bell System Technical Journal, (1948), 379–423.[31] R. G. Staudte : The shapes of things to come: probability density quantiles. Statistics, (2017), 782–800.[32] R. G. Staudte, A. Xia : Divergence from, and convergence to, uniformity of probability density quantiles. Entropy, (2018),Paper No. 317, 10.[33] V. Vigneron, C. Jutten : Fisher information in source separation problems. Lecture Notes in Computer Science (2004),168–176.[34]

J. Virta : On characterizations of the covariance matrix. (2018), Preprint available as arXiv:1810.01147.[35]

J. Virta, K. Nordhausen : On the optimal nonlinearities for gaussian mixtures in FastICA. Latent Variable Analysis and SignalSeparation. 13th International Conference, LVA/ICA 2017, Grenoble, France, February 21-23, 2017, Proceedings, 427–437.[36]

W. R. van Zwet : Convex transformations of random variables. Mathematical Centre Tracts, Mathematisch Centrum, Amsterdam,1964.: Convex transformations of random variables. Mathematical Centre Tracts, Mathematisch Centrum, Amsterdam,1964.