Notion of information and independent component analysis
NN OTION OF INFORMATION AND INDEPENDENT COMPONENTANALYSIS
A P
REPRINT
Una Radojicic
Vienna University of Technology [email protected]
Klaus Nordhausen
Vienna University of Technology [email protected]
Hannu Oja
University of Turku [email protected]
June 22, 2020 A BSTRACT
Partial orderings and measures of information for continuous univariate random variables with specialroles of Gaussian and uniform distributions are discussed. The information measures and measuresof non-Gaussianity including third and fourth cumulants are generally used as projection indices inthe projection pursuit approach for the independent component analysis. The connections betweeninformation, non-Gaussianity and statistical independence in the context of independent componentanalysis is discussed in detail. K eywords Dispersion · entropy · kurtosis · partial orderings In the engineering literature independent component analysis (ICA) [12, 23] is often described as a search for theuncorrelated linear combinations of the original variables that maximize non-Gaussianity. The estimation procedurethen usually has two steps. First, the vector of principal components is found and the components are standardized tohave zero means and unit variances, and second, the vector is further rotated so that the new components maximize aselected measure of non-Gaussianity. It is then argued that the components obtained in this way are made as independentas possible or that they display the components with maximal information. [12] for example give a heuristic argumentthat, according to the central limit theorem, weighted sums of independent non-Gaussian random variables are closerto Gaussian than the original ones. In this paper, we discuss and clarify the somewhat vague connections betweennon-Gaussianity, independence and notions of information in the context of the independent component analysis.In Section 2 we first introduce descriptive measures for location, dispersion, skewness and kurtosis of univariaterandom variables with some discussion of corresponding partial orderings. In this part of the paper we assume thatthe considered univariate random variable x has a finite mean E ( x ) and variance V ar ( x ) , cumulative distributionfunction F and continuously differentiable probability density function f . Skewness, kurtosis and other cumulants ofthe standardized variable ( x − E ( x )) / (cid:112) V ar ( x ) are often used to measure non-Gaussianity of the distribution of x .The most popular measures of statistical information are the differential entropy H ( f ) = − (cid:82) f ( x ) log( f ( x )) dx andthe Fisher information in the location model, that is, J ( f ) = (cid:82) f ( x )[ f (cid:48) ( x ) /f ( x )] dx . These and other informationmeasures with related partial orderings and their use as measures of non-Gaussianity are discussed in the later part ofSection 2.The multivariate independent components model is discussed in Section 3. It is then assumed that, for a p -variate randomvector x , there is a linear operator A ∈ R p × p such that Ax has independent components. Under certain assumptions,the projection pursuit approach can be used to find the rows of A one-by-one and various information measures as wellas cumulants have been used as projection indices. In Section 3 the connections between non-Gaussianity, independenceand information in this context is discussed in detail. The paper ends with some final remarks in Section 4. a r X i v : . [ m a t h . S T ] J un PREPRINT - J
UNE
22, 2020
We consider a continuous random variable x with the finite mean E ( x ) , finite variance V ar ( x ) , density function f and cumulative density function F . Location, dispersion, skewness and kurtosis are often considered by defining thecorresponding measures or functionals for these properties. Location and dispersion measures, write T ( x ) and S ( x ) ,are functions of the distribution of x and defined as follows. Definition 2.1. T ( x ) ∈ R is a location measure if T ( ax + b ) = aT ( x ) + b , for all ∀ a, b ∈ R .2. S ( x ) ∈ R + is a dispersion measure if S ( ax + b ) = | a | S ( x ) , for all ∀ a, b ∈ R . Clearly, if T is a location measure and x is symmetric around µ , then T ( x ) = µ for all location measures. For squareddispersion measures S , [10] considered the concepts of additivity, subadditivity and superadditivity. These conceptsappear to be crucial in developing tools for the independent component analysis and are defined as follows. Definition 2.2.
Let S be a squared dispersion measure.1. S is additive if S ( x + y ) = S ( x ) + S ( y ) for all independent x and y .2. S is subadditive if S ( x + y ) ≤ S ( x ) + S ( y ) for all independent x and y .3. S is superadditive if S ( x + y ) ≥ S ( x ) + S ( y ) for all independent x and y . The mean E ( x ) and the variance V ar ( x ) are important and most popular location and squared dispersion measures.It is well known that V ar ( x + y ) = V ar ( x ) + V ar ( y ) for independent x and y , and E ( x + y ) = E ( x ) + E ( y ) istrue even for dependent x and y . These additivity properties are highly important in certain applications and in factcharacterize the mean and variance among continuous measures as follows. Theorem 2.1.
1. Let a location measure T be additive and continuous at N (0 , , that is, z n → d z ∼ N (0 , implies that T ( z n ) → T ( z ) = 0 . Then T ( x ) = E ( x ) for all x with finite second moments.2. Let a squared dispersion measure S be additive and continuous at N (0 , , that is, z n → d z ∼ N (0 , implies that S ( z n ) → S ( z ) > . Then S ( x ) = S ( z ) V ar ( x ) for all x with finite second moments. Comparison of different location measures T and T and dispersion measures S and S , provides measures ofskewness and kurtosis as Sk( x ) = T ( x ) − T ( x ) S ( x ) and Ku( x ) = S ( x ) S ( x ) . Classical measures of skewness and kurtosis proposed in the literature can be written in this way. Note that bothmeasures are affine invariant in the sense that
Sk( ax + b ) = sgn ( a )Sk( x ) and Ku( ax + b ) = Ku( x ) . If x has a symmetric distribution, then Sk( x ) = 0 . In the literature, kurtosis measures are thought to measure thepeakedness and/or the heaviness of the tails of the density of x but, as we will see in Section 2.3, Ku( x ) as defined heremay be a global measure of deviation from the normality and have also been used as an affine invariant informationmeasure for some special choices of the dispersion measures S and S .Moment and cumulant generating functions defined as E (cid:2) e tx (cid:3) = ∞ (cid:88) k =0 µ k t k /k ! and log E (cid:2) e tx (cid:3) = ∞ (cid:88) k =0 κ k t k /k ! respectively, generate classical measures, i.e., moments E ( x ) = µ ( x ) and V ar ( x ) = µ ( x − µ ( x )) and cumulants κ ( x st ) and κ ( x st ) where x st = ( x − E ( x )) / (cid:112) V ar ( x ) . The cumulants κ k , k = 1 , , ... are additive as log E [ e tx ] is additive, and κ /kk ( x − E ( x )) , k = 2 , , ... are subadditive squared dispersion measures which follows from theMinkowski inequality, see [10]. Another class of measures is given by the quantiles q u = F − ( u ) , < u < , withcorresponding measures such as q / , q − u − q u , q u + q − u − q / q − u − q u , and q − u − q u q − v − q v , < u < v < . PREPRINT - J
UNE
22, 2020These quantile based measures provide robust alternatives to moment based measures. To our knowledge, they howeverlack the additivity properties stated in Definition 2.2 which makes them unsuitable for usage in the independentcomponent analysis.An alternative strategy to consider the properties of distributions is to define partial orderings for location, dispersion,skewness and kurtosis. For continuous x and y with cumulative distribution functions F and G , write ∆( x ) = G − ( F ( x )) − x . The function ∆( x ) is called a shift function of x as x when shifted by ∆( x ) and has the distributionof y . The transformation x (cid:55)→ x + ∆( x ) is also known as the (univariate) Monge-Kantorovich optimal transport map.Using function ∆ we can naturally define the following partial orderings [3, 4, 36, 25].1. Location ordering: ∆ is positive.2. Dispersion ordering: ∆ is increasing. 3. Skewness ordering: ∆ is convex.4. Kurtosis ordering: ∆ is concave-convex.[3, 4, 25] then stated that, in addition to the affine equivariance and invariance properties, the measures of location,dispersion, skewness and kurtosis should be monotone with respect to corresponding orderings. For finding monotonemeasures in the dispersion case, for example, ∆ is increasing if and only if E [ C ( x − E ( x ))] ≤ E [ C ( y − E ( y ))] for all convex C .which is also called the dilation order. It implies for example that the measures ( E [ | x − E ( x ) | k ]) /k , k > , aremonotone dispersion measures. Consider a discrete random variable with k possible values (‘alphabets’) with probabilities listed in p = ( p , ..., p k ) .Write p (1) ≤ ... ≤ p ( k ) for the ordered probabilities. It is sometimes presumed that a distribution p is informative if itcan provide ‘surprises’ with very small p i ’s. On the other hand, people often claim that p is informative if the result ofthe experiment is known with a high probability, that is, if only one or few values have high p i ’s. These somewhat naivecharacterizations suggest the following well-known partial ordering for discrete distributions [19]. Definition 2.3.
Majorization : p ≺ q if j (cid:88) i =1 p ( i ) ≥ j (cid:88) i =1 q ( i ) , j = 1 , ..., k , and then p is said to be majorized by q . Majorization is nothing but a dispersion ordering (and a dilation order) for the discrete distributions with k equiprobablevalues p , ..., p k in [0 , with mean /k . Then, according to [27], p ≺ q ⇔ p = q L with some doubly stochastic matrix L ⇔ (cid:80) ki =1 C ( p i ) ≤ (cid:80) ki =1 C ( q i ) for all continuous convex C .The doubly stochastic matrix L is a matrix with non-negative elements such that all row sums and all column sums areone. The doubly stochastic operator L is then in fact a convex combination of permutations; p is obtained from q bythis ‘smoothing’ and is therefore less informative. Further, for all p , (1 /k, ..., /k ) ≺ p ≺ (0 , ..., , and, for simple mixtures, p ≺ q ⇒ p ≺ λp + (1 − λ ) q ≺ q, ≤ λ ≤ . We can now give the following.
Definition 2.4.
Let p = ( p , ..., p k ) list the probabilities of k possible values of a discrete random variable, thatis, p , ..., p k ∈ [0 , , (cid:80) ki =1 p i = 1 . A measure M ( p ) is a information measure if it is monotone with respect tomajorization. Note that, as ( p , ..., p k ) ≺ ( p (1) , ..., p ( k ) ) ≺ ( p , ..., p k ) , the definition implies that the information measures areinvariant under permutations of the probabilities in ( p , ..., p k ) . The equivalent conditions for majorization then suggestquantities such as H ( p ) = − k (cid:88) i =1 log( p i ) p i , H ∗ ( p ) = k (cid:88) i =1 p i and H ∗∗ ( p ) = p ( k ) and − H , H ∗ and H ∗∗ are monotone information measures that easily extend to continuous and multivariate cases. The Shannon’s entropy [30] − (cid:80) ki =1 log ( p i ) p i is often seen as a measure of ability to compress the data (e.g. lower boundfor the expected number of bits to store the data). 3 PREPRINT - J
UNE
22, 2020
Consider next a continuous random variable x with the continuously differentiable probability density function f andfinite variance V ar ( x ) . The three measures from the discrete case straightforwardly extend in the continuous case to H ( x ) = − E [log f ( x )] = − (cid:90) ∞−∞ f ( x ) log f ( x ) dx,H ∗ ( x ) = E [ f ( x )] = (cid:90) ∞−∞ f ( x ) dx, and H ∗∗ ( x ) = sup x f ( x ) = f ( x mode ) , if the mode x mode exists . The Fisher information in the location model f ( · − µ ) at µ = 0 given by J ( x ) = (cid:90) ∞−∞ f ( x ) (cid:18) f (cid:48) ( x ) f ( x ) (cid:19) dx. is also often used as an information measure [16].The measure H ( x ) is popular in the literature and known as the differential entropy . Under certain restrictions, themeasure has the following maximizers [7]. For the distributions on R with a fixed variance, H ( x ) is maximized if x hasa normal distribution. For distributions on R + with a fixed mean, H ( x ) is maximized at the exponential distribution.For distributions on a finite interval, H ( x ) is maximized at the uniform distribution on that interval. Note that, in theBayesian analysis, these three distributions are often used as priors that reflect ‘total ignorance’.We next show that the three straightforward extensions H , H ∗ and H ∗∗ as well as the Fisher information J providesquared dispersion measures as in Definition 2.1 but with an interesting additional invariance property. First note thatthe measures are invariant under location shift of the distribution but not under rescaling of the variable. Recall thatinformation as stated for discrete distributions is invariant under the permutations of the probabilities in ( p , ..., p k ) . Allpermutations consist of successive pairwise exchanges of two probabilities. In the continuous case, similar elementalprobability density transformations may be constructed as follows. For all a < a + ∆ < b < b + ∆ and density function f , write f a,b, ∆ ( x ) = (cid:40) f ( x ) , x ∈ R − [ a, a + ∆] − [ b, b + ∆] f ( b + ( x − a )) , x ∈ [ a, a + ∆] f ( a + ( x − b )) , x ∈ [ b, b + ∆] The transformation allows the manipulation of the properties of the distribution in many ways. The transformationcan for example be used to move some probability mass from the centre of distribution to the tails and in this way tomanipulate the variance and the kurtosis of the distribution for example. As far as we know, this transformation has notbeen discussed in the literature. It is surprising that the information measures H , H ∗ , H ∗∗ and J provide dispersionmeasures which are invariant under these transformations. Theorem 2.2.
The entropy power e H ( x ) and measures [ H ∗ ( x )] − , [ H ∗∗ ( x )] − and [ J ( x )] − are squared disper-sion measures that are invariant under the transformations f → f a,b, ∆ . The measures e H ( x ) and [ J ( x )] − aresuperadditive. We now further discuss the properties of the dispersion measures in Theorem 2.2 and, to find affine invariant informationmeasures, consider the ratios of the variance to these squared dispersion measures. The ratio of the variance to theentropy power, that is,
V ar ( x ) e − H ( x ) is minimized at the normal distribution [7]. In a neighbourhood of a normaldistribution the negative entropy − H ( x ) has an interesting approximation using third and fourth cumulants. [13]showed that the negative differential entropy for the density f ( x ) = ϕ ( x )(1 + (cid:15) ( x )) where ϕ is the density of N (0 , and (cid:15) is a well-behaved “small” function that satisfies E [ (cid:15) ( z ) z k ] = 0 , z ∼ N (0 , , k = 0 , , , can be approximatedby (1 / (cid:82) ϕ ( x ) (cid:15) ( x ) dx ≈ ( κ ( x ) + (1 / κ ( x )) / .Next, [ H ∗ ( x )] − is a (squared) dispersion measure, and therefore [ H ∗ ( x )] V ar ( x ) provides an affine invariantinformation measure. For symmetric distributions, it preserves the concave-convex kurtosis ordering of van Zwet and H ∗ ( x )] V ar ( x ) is in fact the efficiency of the Wilcoxon rank test with respect to the t -test. Also, for symmetricdistributions, H ∗∗ ( x )] V ar ( x ) is a kurtosis measure in the van Zwet sense and simultaneously the efficiency of thesign test with respect to the t -test. We also mention that, if Q ( x ) = E (cid:2) f ( F − ( u )) /ϕ (Φ − ( u )) (cid:3) with u ∼ U (0 , ,then Q − ( x ) is a squared dispersion measure and Q ( x ) V ar ( x ) is the efficiency of the van der Waerden test with4 PREPRINT - J
UNE
22, 2020respect to the t -test in the symmetric case. By the Chernoff-Savage theorem, it attains its minimum 1 at the normaldistribution. See [6, 9].Finally, the information measure V ar ( x ) J ( x ) ≥ is minimized at the normal distribution. In the location estimationproblem in the symmetric case, V ar ( x ) J ( x ) is also the asymptotic relative efficiency of the maximum likelihoodestimate of the symmetry centre with respect to the sample mean [29]. We next outline how to construct partial orderings for information in the univariate continuous case. Let first x bea continuous random variable with density f on (0 , . If m ( y ) = µ { u : f ( u ) > y } where µ is Lebesgue measure,then the function f ↓ ( u ) = sup { y : m ( y ) > u } , u ∈ (0 , , provides the decreasing rearrangement of f . Note thatany density function on (0 , can be approximated by a simple density function f ( x ) = (cid:80) ki =1 α i χ A j ( x ) , where α < α < · · · < α k , and A , ..., A k are disjoint Lebesque-measurable sets on (0 , and χ A is the characteristicfunction of set A . Then m ( y ) = k (cid:88) i =1 β i χ B i ( y ) and f ↓ ( u ) = k (cid:88) i =1 α i χ [ β i − ,β i ) ( u ) , where β i = (cid:80) ij =1 µ ( A j ) , B i = [ α i +1 , α i ) for i = 1 , , . . . , k , and β = α k +1 = 0 . For a better insight, see Figure 1.For more details and examples, see e.g. [15]. A2 A1 A30 a a a xf b b b a a a ym b b b a a a uf fl Figure 1: Simple function f (left), its distribution function m (middle) and decreasing rearrangement f ↓ (right)Using the decreasing rearrangement we can give the following definitions. Definition 2.5.
Let f and g be density functions on the interval (0 , . Then g has more information than f , write f ≺ g , if (cid:90) u f ↓ ( v ) dv ≤ (cid:90) u g ↓ ( v ) dv, for all u ∈ (0 , Definition 2.6.
Let F (0 , be the set of density functions f on the interval (0 , . Then M (0 , : F (0 , → R is aninformation measure if it is monotone with respect to the partial ordering in Definition 2.5 The distribution with minimum information is the uniform distribution on (0 , . Information measures are easily found,see [28], as f ≺ g if and only if (cid:90) C ( f ( u )) du ≤ (cid:90) C ( g ( u )) du for all continuous convex functions C [28] also discusses how to construct linear operators L for which f = Lg ≺ g when f ≺ g .5 PREPRINT - J
UNE
22, 2020Consider next a continuous random variable x on R with pdf f . To find a location and scale-free version of the density,[31] proposed the transformation f ( x ) , x ∈ R → f ∗ ( u ) = f ( F − ( u )) H ∗ ( x ) , u ∈ (0 , . Then f ∗ , called the probability density quantile (pdQ) , is a probability density function on (0 , which is invariantunder linear transformations of the original variable x [31]. It is also true that, for given f ∗ , the original f is known uplocation and scale. Using this density transformation, the definition of an invariant information measure for densities on R can be given as follows. Definition 2.7.
Let F R be a set of density functions f on R and let M (0 , : F (0 , → R be an information measure fordistributions on (0 , . Then M R : f → M (0 , ( f ∗ ) is an information measure in the set F R . Note that M R is not an extension of M (0 , meaning that, f ∈ F (0 , does not imply that M R ( f ) = M (0 , ( f ) . M R isinvariant under rescaling of f while M (0 , is not.Applying Definition 2.7 and choosing convex C ( u ) = − log ( u ) and C ( u ) = log ( u ) u , we get location and scaleinvariant information measures for f such as exp {− (cid:90) log( f ∗ ( u )) du } = e H ( x ) [ H ∗ ( x )] and exp { (cid:90) log( f ∗ ( u )) f ∗ ( u ) du } = e − H ( f /H ∗ ( x )) [ H ∗ ( x )] − , which attain their minimum at the uniform distribution and are invariant under the transformations f → f a,b, ∆ . Formore details see e.g. [32].To replace the transformation f → f ∗ by a transformation to densities on (0 , for which minimum information isattained at any density g , one can use the following adjustment. Theorem 2.3.
Let x and y be random variables on R with the probability density functions f and g and cumulativedistribution functions F and G , respectively. Then f : g ( u ) = f ( G − ( u )) g ( G − ( u )) is a density function on (0 , and its differential entropy is − H ( f : g ) ≥ is the Kullback-Leibler (KL) divergencebetween the distributions of x and y . Let again x have a density f and let ϕ and Φ be the pdf and the cdf of a normal distribution with mean E ( x ) andvariance V ar ( x ) . Then one can show, using similar arguments as in [31], that f : ϕ ( u ) = f (Φ − ( u )) ϕ (Φ − ( u )) , u ∈ (0 , is a location and scale-free density and information measures in Definition 2.6 applied to the set of densities ˜ f = f : ϕ attain their minimums when f has a normal distribution. A collection of information measures is given by (cid:82) C ( ˜ f ( u )) du with continuous and convex functions C and then we get for example again exp { (cid:90) log ( ˜ f ( u )) ˜ f ( u ) du } = (2 πe ) e − H ( x ) V ar ( x ) . We next provide examples on the probability density functions f , f ∗ and ˜ f when f is the density of Gaus-sian, Laplace, Lognormal and Uniform distributions. Also a mixture of two Gaussian distributions denoted by GM M ( µ , µ , σ , σ , w ) is considered with the densities wϕ µ ,σ ( x ) + (1 − w ) ϕ µ ,σ ( x ) , ≤ w ≤ . Figure 2 thenshows the impact of the transformations f → f ∗ and f → ˜ f in these cases.Distribution e H ( f ) e H ( f ∗ ) e H ( ˜ f ) H ∗ ( f ) − H ∗ ( f ∗ ) − H ∗ ( ˜ f ) − N(0,1) 17.079 0.824 1.000 12.566 0.750 1.000Laplace(1) 29.556 0.680 0.887 16.000 0.719 0.783Lognormal(0,1) 17.079 0.642 0.308 7.622 0.537 0.186U(0,1) 1.000 1.000 0.703 1.000 1.000 0.567GMM( , , , , . ) 100.000 0.862 0.855 78.000 0.792 0.756Table 1: The power entropy and the [ H ∗ ] − measure for some continuous distributions and their transformations.6 PREPRINT - J
UNE
22, 2020Table 1 provides for the same distributions the power entropies e H ( · ) and H ∗ ( · ) − for f , f ∗ and ˜ f . Note that theinformation measures applied to f are not invariant under rescaling of x as opposed to f ∗ and ˜ f . For example for thesettings we use in the Table 1, the normal and lognormal densities have the same power entropy just by accident and theequality is not generally true. −4 −2 0 2 4 . . . . . f N o r m a l . . . . f * . . . f~ . . . . . U n i f o r m . . . . . . . . −4 −2 0 2 4 . . . . . . l a c e . . . . . . . . −4 −2 0 2 4 . . . . r m a l . . . . . . . . G au ss i an m i x t u r e −3 0 3 6 . . . . . . . . Figure 2: Comparison of f , f ∗ and ˜ f for five distributions.7 PREPRINT - J
UNE
22, 2020For better understanding on the measures, we illustrate the behavior of e H ( · ) and H ∗ ( · ) − in the GMM model with fourfixed and one varying parameter, each in turn. In Figure 3 both information measure curves are plotted in the samefigures to compare the shapes of the curves as well as the occurrences of extreme values. The curves for ˜ f with varyinglocation and scale seem natural as minimum information is attained as GMM gets “closer” to the normal distribution.Results for f ∗ and varying location seem strange in a sense where one would expect decreasing behaviour of bothmeasures as the distance in means increases, as it is case for ˜ f , while the result for f in all three cases could simply beexplained with decrease in information as a result of increase in overall variance of the mixture. e H ( · ) and H ∗ ( · ) − seem to behave almost proportionally in all cases. In cases of f ∗ and ˜ f where the majorization is well defined, suchbehaviour is indeed expected, as the reciprocals of both e H ( · ) and H ∗ ( · ) − are information measures for both f ∗ and ˜ f .However, further investigations into this matter will be conducted in the future. In this section we consider multivariate random variables. For a p -variate random vector x with finite second moments,the mean vector and covariance matrix are E ( x ) ∈ R p and Cov ( x ) ∈ R p × p , respectively. Let Cov ( x ) = UDU (cid:48) be the eigenvector-eigenvalue decomposition of the covariance matrix. Then
Cov ( x ) − / := UD − / U (cid:48) and x st = Cov ( x ) − / ( x − E ( x )) standardizes x , that is, E ( x st ) = and Cov ( x st ) = I p . The set of p × r , r ≤ p , matriceswith orthonormal columns is denoted by O p × r . Thus U ∈ O p × r implies U (cid:48) U = I r . The set of p × p diagonalmatrices with positive diagonal elements is denoted by D p × p . If U ∈ O p × p and D ∈ D p × p then x → Ux and x → Dx , x ∈ R p , are a rotation operator and a componentwise rescaling operator, respectively. Let A ∈ R p × q be amatrix with rank r ≤ min { p, q } . Then the linear operator A may be written as (singular value decomposition, SVD) A = UDV (cid:48) = (cid:80) ri =1 d i u i v (cid:48) i where U = ( u , ..., u r ) ∈ O p × r , V = ( v , ..., v r ) ∈ O q × r , and D ∈ D r × r . Let x be a p -variate vector with the full-rank covariance matrix Cov ( x ) . We say that x has a spherical distribution ifthere exists a µ such that ( x − µ ) ∼ U ( x − µ ) for all orthogonal U . In the following we first define the elliptic andindependent components distributions (see for example [23, 24] for more details). Definition 3.1.
Let x ∈ R p be a p -variate random vector.1. x has an elliptical distribution if there exists a nonsingular A ∈ R p × p such that Ax has a sphericaldistribution.2. x has an independent components distribution if there exists a nonsingular A ∈ R p × p such that Ax hasindependent components. We next provide some results on how the matrix A can be found in different cases. Theorem 3.1.
Let x be a p -variate random vector with a full-rank covariance matrix Cov ( x ) = UDU (cid:48) . Then wehave the following.1. [ VD − / U (cid:48) ] x has uncorrelated components for all orthogonal V .2. If x has an elliptical distribution, [ VD − / U (cid:48) ] x has a spherical distribution for all orthogonal V .3. If x has an independent components distribution, [ VD − / U (cid:48) ] x has independent components for somechoice(s) of orthogonal V .4. If x has both an elliptical distribution and an independent component distribution then [ VD − / U (cid:48) ] x hasindependent Gaussian components for all orthogonal V , that is, x has a multivariate Gaussian distribution. Let x have an independent components distribution such that z = Ax + b is standardized ( E ( z ) = and Cov ( z ) = I p )and has independent components. Theorem 3.1 then implies that A = V (cid:48) Cov ( x ) − / where the rotation matrix V can be chosen as V = ( V , V ) separating non-Gaussian independent components in V (cid:48) Cov ( x ) − / x and Gaussianindependent components in V (cid:48) Cov ( x ) − / x . Note that V is only unique up to right multiplication by an orthogonal8 PREPRINT - J
UNE
22, 2020 (a) GMM( , µ , , , . ) where µ varies. m H * (f) −2 m m (b) GMM( , , , σ , . ) where σ varies. s H * (f) −2 s s (c) GMM( , , , , w ) where w varies. w20253035 0 0.2 0.4 0.6 0.8 1 152025e H * (f) −2 w0.800.810.820.830.840.850.860.87 0 0.2 0.4 0.6 0.8 1 0.720.740.760.780.800.82 w0.9700.9750.9800.9850.9900.9951.000 0 0.2 0.4 0.6 0.8 1 0.950.960.970.980.991.00 Figure 3: Power entropy and [ H ∗ ] − for different GMMs when always one parameter varies. The left vertical axiscorresponds to power entropy and the right axis to [ H ∗ ] − . The left panel gives the measures for f , the middle for f ∗ and the right for ˜ f .matrix. A generally accepted strategy is to find V = ( v , ..., v q ) ∈ O p × q such that the components of V (cid:48) x st areas ‘non-Gaussian as possible’. The Gaussian part V (cid:48) Cov ( x ) − / x is thought to be just the noise part and, for othercomponents, it is argued that the sum of independent random variables is ‘more Gaussian’ than the original variables.The noise interpretation of the Gaussian part may be motivated by the following. A random vector has a multivariatenormal distribution if and only if all linear combinations of the marginal variables have univariate normal distributions,that is, there are no ‘interesting’ directions. The normal distribution is the only distribution for which all third andhigher cumulants are zero. As seen before, a Gaussian distribution is the distribution with the poorest informationamong distributions with the same variance (highest entropy, smallest Fisher information). For a thorough discussionof Gaussian distributions, see [14]. 9 PREPRINT - J
UNE
22, 2020Let D ( x ) then be the projection index , i.e., the functional that is used to measure non-Gaussianity. In the one-by-one projection pursuit approach the first direction v ( v (cid:48) v = 1 ) maximizes D ( v (cid:48) x st ) , the second direction v isorthogonal to v ( v (cid:48) v = 1 , v (cid:48) v = 0 ) and maximizes D ( v (cid:48) x st ) and so on. After finding v , ..., v j − , we optimizethe Lagrangian function L ( v ; λ j , ..., λ jj ) = D ( v (cid:48) x st ) − λ jj ( v (cid:48) v − − j − (cid:88) i =1 λ ji v (cid:48) v i . Then v j solves the (estimating) equation ( I p − (cid:80) j − i =1 v i v (cid:48) i ) T ( v ) = ( T ( v ) (cid:48) v ) v , where T ( v ) = ∂D ( v (cid:48) x st ) /∂ v . Fromthe computational point of view, this suggests a fixed-point algorithm . The estimation equation also provides a way tofind the limiting distribution of the estimate, since the estimate is obtained when the theoretical multivariate distributionis replaced by the empirical one. See for example [20, 21, 22] and references therein for more details.The following questions naturally arise. How should one choose the projection index D ( x ) to find the independentcomponents? Are the independent components provided by the most informative directions as has been often stated inthe literature? These questions are partially answered by the following. Theorem 3.2.
Let z = Ax + b = ( z , ..., z p ) (cid:48) be the vector of standardized independent components.1. Let D ( x ) be a subadditive squared dispersion measure.Then D ( v (cid:48) x st ) ≤ max j D ( z j ) .2. Let D ( x ) be a superadditive squared dispersion measure.Then D ( v (cid:48) x st ) ≥ min j D ( z j ) . Based on Theorem 3.2 and the discussion above we can now end the paper with the following conclusions. If D ( x ) is subadditive then it can be used as a projection index. For example the cumulants κ / (2 k +1)2 k +1 ( x ) and κ / (2 k +2)2 k +2 ( x ) , k = 1 , , . . . , when calculated for standardized distributions, provide squared dispersion measures that are subadditive.Therefore they can be used as projection indices. For superadditive D ( x ) , the functional ( D ( x )) − is a valid projectionindex as ( D ( v (cid:48) x st )) − ≤ max j ( D ( z j )) − . As seen before, entropy power e H ( x ) and the inverse of Fisher information, J − ( x ) are superadditive squared dispersion measures. Note that in both cases D ( v (cid:48) x st ) is in fact a ratio of twosquared dispersion functions, and the projection index measures deviation from Gaussianity using a skewness, kurtosisor information measure. As mentioned in Section 3.3, ( κ ( x ) + (1 / κ ( x )) / provides an approximation of negativedifferential entropy in a neighborhood of Gaussian distribution and is a valid projection index as well. For furtherdiscussion, see [10]. Note also that one of the most popular ICA procedures in the engineering community, the so called fastICA , uses a projection index of the form D ( x ) = | E [ C ( x )]) | where C is such a function that, if z ∼ N (0 , then E [ C ( z )] = 0 . Examples of valid choices of C are C ( z ) = z and C ( z ) = z − providing again the third and fourthcumulants, respectively. The usage of various information criteria is popular in independent component analysis. The connections betweennotions of information and statistical independence and the special role of the Gaussian distribution were discussedin detail in the paper. We also introduced new ideas and partial orderings for information which utilize transformedlocation and scale-free probability density functions. In independent component analysis with unknown marginaldensities, the estimation of the value of the adapted information measure in a given direction is highly challenging andit has to be done again and again when applying the fixed point algorithm for the correct direction. Substantial researchis therefore still needed for these tools to be of practical value.
Proof of Theorem 2.1.
Let x , ..., x n be a random sample from a distribution of x with the mean value E ( x ) andvariance V ar ( x ) . By the central limit theorem, z n = 1 √ n n (cid:88) i =1 x i − E ( x ) (cid:112) V ar ( x ) → d z ∼ N (0 , . Therefore, by additivity and affine equivariance, T ( z n ) = (cid:114) nV ar ( x ) ( T ( x ) − E ( x )) → and S ( z n ) = S ( x ) V ar ( x ) → S ( z ) PREPRINT - J
UNE
22, 2020and the result follows. For similar results in the multivariate case, see [34].
Proof of Theorem 2.2.
The invariances of the measures H ( x ) , H ∗ ( x ) , H ∗∗ ( x ) and J ( x ) under location shifts f ( x ) → f ( x + b ) , sign change f ( x ) → f ( − x ) as well as under f → f a,b, ∆ follow easily from their definitionsand from the definition of the Riemann integral. We therefore only have to consider rescaling f ( x ) → (1 /a ) f ( x/a ) with a > . Then H ( ax ) = − (cid:82) (1 /a ) f ( x/a ) log((1 /a ) f ( x/a )) dx = − (cid:82) f ( x ) log((1 /a ) f ( x )) dx = H ( x ) + log( a ) and therefore e H ( ax ) = a e H ( x ) . In a similar way one can show that [ H ∗ ( ax )] − = a [ H ∗ ( x )] − . Also easily [ H ∗∗ ( ax )] − = a [ H ∗∗ ( x )] − . As f (cid:48) ( x ) → (1 /a ) f (cid:48) ( x/a ) one also easily shows that [ J ( ax )] − = a [ J ( x )] − .Thus all the four measure are scale equivariant and therefore squared dispersion measures. Proof of Theorem 2.3. f : g is indeed a density function since it is trivially nonnegative and (cid:82) ( f : g )( u )d u = (cid:82) f ( G − ( u )) g ( G − ( u )) d u = (cid:82) ∞−∞ f ( x )d x = 1 with the substitution x = G − ( u ) . Similary, − H ( f : g ) = (cid:82) ( f : g )( u ) log(( f : g )( u ))d u = (cid:82) ∞−∞ f ( x ) log f ( x ) g ( x ) d x = D ( f || g ) . Proof of Theorem 3.1. (1) Let V be orthogonal. As Cov ([ VD − / U (cid:48) ] x ) = VD − / U (cid:48) Cov ( x ) UD − / V (cid:48) = VV (cid:48) = I p , the components of [ VD − / U (cid:48) ] x ) are uncorrelated. (2) Assume that Ax is spherical with A = VCW (cid:48) rescaled so that
Cov ( Ax ) = I p . As A Cov ( x ) A (cid:48) = I p , Cov ( x ) = ( A (cid:48) A ) − and WC − W (cid:48) = UDU (cid:48) . Therefore W = U and C = D − / and we can conclude that [ VD − / U (cid:48) ] x is spherical for any orthogonal V . (If x isspherical then Vx is spherical for all orthogonal V .) (3) Let Ax with A = VCW (cid:48) have independent and standardizedcomponents so that
Cov ( Ax ) = I p . As in (2), A must be VD − / U (cid:48) but now for some V only. (It is not true that if x has independent standardized components then Vx has independent components for any choice of V .) (4) Based on (2)and (3), there exist an A = VD − / U (cid:48) such that Ax has a spherical distribution with independent components. Thenby the Maxwell-Hershell theorem, Ax has a multivariate normal distribution. For the proof of the Maxwell-Hershelltheorem, see e.g. Proposition 4.11. in [5]. Proof of Theorem 3.2.
Let z = Ax + b = ( z , ..., z p ) (cid:48) be a vector of standardized independent components.By Theorem 3.1, z = Vx st with some orthogonal V . If u (cid:48) u = 1 then also ( Vu ) (cid:48) ( Vu ) = 1 and therefore D ( u (cid:48) x st ) = D ( u (cid:48) Vz ) ≤ (cid:80) ( V (cid:48) u ) i D ( z i ) ≤ max j D ( z j ) for subadditive squared dispersion measure D and D ( u (cid:48) x st ) = D ( u (cid:48) Vz ) ≥ (cid:80) ( V (cid:48) u ) i D ( z i ) ≥ min j D ( z j ) for superadditive squared dispersion measure D . The work of KN has been supported by the Austrian Science Fund (FWF) Grant number P31881-N32.
References [1]
A. R. Barron : Entropy and the central limit theorem. Ann. Probab. (1986), 336–342.[2] A. J. Bell, T. J. Sejnowski : An information-maximization approach to blind separation and blind deconvolution. NeuralComput. (1995), 1129–1159.[3] P. J. Bickel, E. L. Lehmann : Descriptive statistics for nonparametric models II: Location. Ann. Stat. (1975), 1045–1069.[4] P. J. Bickel, E. L. Lehmann : Descriptive statistics for nonparametric models III: Dispersion. Ann. Stat. (1976), 1139–1158.[5] M. Bilodeau, D. Brenner : Theory of multivariate statistics. Springer Texts in Statistics. New York: Springer (1999).[6]
H. Chernoff, I. R. Savage : Asymptotic normality and efficiency of certain nonparametric test statistics. Ann. Math. Stat. (1958), 972–994.[7] T. Cover, J. Thomas : Elements of information theory. New York: John Wiley & Sons. (1991).[8]
L. Faivishevsky, J. Goldberger : ICA based on a smooth estimation of the differential entropy. Advances in Neural InformationProcessing Systems (2008), 433–440.[9] J. L. Hodges, E. L. Lehmann : The efficiency of some nonparametric competitors of the t-test. Ann. Math. Stat. (1956),324–335.[10] P. J. Huber : Projection pursuit. Ann. Stat. (1985), 435–475.[11] A. Hyvärinen : New approximations of differential entropy for independent component analysis and projection pursuit.Advances in Neural Information Processing Systems, (1998), 273–279.[12] A. Hyvärinen, J. Karhunen, E. Oja : Independent component analysis. John Wiley & Sons, New York (2001) PREPRINT - J
UNE
22, 2020 [13]
M. C. Jones, R. Sibson : What is projection pursuit? J. R. Stat. Soc., Ser. A 150, (1987), 1–36.[14]
K. Kim, G. Shevlyakov : Why Gaussianity? IEEE Signal Process. Mag. (2008), 102–113.[15] E. Kristiansson : Decreasing Rearrangement and Lorentz L(p,q) Spaces (Thesis). Department of Mathematics of the LuleaUniversity of Technology, (2002). Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.111.1244&rep=rep1&type=pdf .[16]
S. Kullback : Information theory and statistics. John Wiley and Sons, Inc., New York; Chapman and Hall, Ltd., London 1959.[17]
E. G. Learned-Miller, J. W. Fisher III : ICA using spacings estimates of entropy. J. Mach. Learn. Res. (2004), 1271–1295.[18] B. G. Lindsay, W. Yao : Fisher information matrix: A tool for dimension reduction, projection pursuit, independent componentanalysis, and more. Can. J. Statistics (2012), 712–730.[19] A. W. Marshall, I. Olkin : Inequalities: Theory of majorization and its applications. Mathematics in Science and Engineering,Vol. 143. Academic Press, New York, 1979.[20]
J. Miettinen, K. Nordhausen, H. Oja, S. Taskinen : Deflation-based FastICA with adaptive choices of nonlinearities. IEEETrans. Signal Process. (2014), 5716–5724.[21] J. Miettinen, K. Nordhausen, H. Oja, S. Taskinen : Fourth moments and independent component analysis. Stat. Sci. (2015),372–390.[22] J. Miettinen, K. Nordhausen, H. Oja, S. Taskinen, J. Virta : The squared symmetric fastICA estimator. Signal Process. (2017), 402–411.[23]
K. Nordhausen, H. Oja : Independent component analysis: a statistical perspective. Wiley Interdiscip. Rev. Comput. Stat. (2018), e1440.[24] K. Nordhausen, H. Oja : Robust nonparametric inference. Annu. Rev. Stat. Appl. (2018), 473–500.[25] H. Oja : On location, scale, skewness and kurtosis of univariate distributions. Scand. J. Stat. (1981), 154–68.[26] E. Parzen : Quantile probability and statistical data modeling. Statist. Sci. (2004), 652–662.[27] J. E. Peˇcari´c, F. Proschan, Y. L. Tong : Convex functions, partial orderings, and statistical applications. Mathematics inScience and Engineering, 187. Academic Press, Boston, 1992.[28]
J. V. Ryff : On the representation of doubly stochastic operators. Pacific J. Math. (1963), 1379–1386.[29] R. Serfling : Asymptotic relative efficiency in estimation. International Encyclopedia of Statistical Science. Springer (2011),68–72.[30]
C. E. Shannon : A mathematical theory of communication. The Bell System Technical Journal, (1948), 379–423.[31] R. G. Staudte : The shapes of things to come: probability density quantiles. Statistics, (2017), 782–800.[32] R. G. Staudte, A. Xia : Divergence from, and convergence to, uniformity of probability density quantiles. Entropy, (2018),Paper No. 317, 10.[33] V. Vigneron, C. Jutten : Fisher information in source separation problems. Lecture Notes in Computer Science (2004),168–176.[34]
J. Virta : On characterizations of the covariance matrix. (2018), Preprint available as arXiv:1810.01147.[35]
J. Virta, K. Nordhausen : On the optimal nonlinearities for gaussian mixtures in FastICA. Latent Variable Analysis and SignalSeparation. 13th International Conference, LVA/ICA 2017, Grenoble, France, February 21-23, 2017, Proceedings, 427–437.[36]
W. R. van Zwet : Convex transformations of random variables. Mathematical Centre Tracts, Mathematisch Centrum, Amsterdam,1964.: Convex transformations of random variables. Mathematical Centre Tracts, Mathematisch Centrum, Amsterdam,1964.