[PDF] WICA: nonlinear weighted ICA

Abstract

Independent Component Analysis (ICA) aims to find a coordinate system in which the components of the data are independent. In this paper we construct a new nonlinear ICA model, called WICA, which obtains better and more stable results than other algorithms. A crucial tool is given by a new efficient method of verifying nonlinear dependence with the use of computation of correlation coefficients for normally weighted data. In addition, authors propose a new baseline nonlinear mixing to perform comparable experiments, and a~reliable measure which allows fair comparison of nonlinear models. Our code for WICA is available on Github this https URL.

Full PDF

WWICA: nonlinear weighted ICA

Andrzej Bedychaj and Przemysław Spurek and

Aleksandra Nowak and

Jacek Tabor

Abstract.

Independent Component Analysis (ICA) aims to ﬁnd acoordinate system in which the components of the data are indepen-dent. In this paper we construct a new nonlinear ICA model, calledWICA, which obtains better and more stable results than other algo-rithms. A crucial tool is given by a new efﬁcient method of verifyingnonlinear dependence with the use of computation of correlation co-efﬁcients for normally weighted data. In addition, authors propose anew baseline nonlinear mixing to perform comparable experiments,and a reliable measure which allows fair comparison of nonlinearmodels. Our code for WICA is available on Github . The goal of Linear Independent Component Analysis (ICA) is toﬁnd such a unmixing function of the given data that the result-ing representation has statistically independent components. Com-mon tools solving this problem are based on maximizing some mea-sure of nongaussianity, e.g. kurtosis (Hyvärinen, 1999; Bell and Se-jnowski, 1995) or skewness (Spurek et al., 2017). Clearly, an ob-vious limitation of those approaches is the assumption of linear-ity, as the real world data usually contains complicated and non-linear dependencies (see for instance (Larson, 1998; Ziehe et al.,2000)). Designing an efﬁcient and easily implementable nonlinearanalogue of ICA is a much more complex problem than its lin-ear counterpart. A crucial complication is that without any limita-tions imposed on the space of the mixing functions the problem ofnonlinear-ICA is ill-posed, as there are inﬁnitely many valid solu-tions (Hyvärinen and Pajunen, 1999).As an alternative to the fully unsupervised setting of the nonlin-ear ICA one can assume some prior knowledge about the distributionof the sources, which allows to obtain identiﬁability (Hyvarinen andMorioka, 2016; Hyvärinen et al., 2019). Several algorithms exploit-ing this property have been recently proposed, either assuming ac-cess to segment labels of the sources (Hyvarinen and Morioka, 2016),temporal dependency of the sources (Hyvärinen and Morioka, 2017)or, generally, that the sources are conditionally independent, and theconditional variable is observed along with the mixes (Hyvärinenet al., 2019; Khemakhem et al., 2019). However it may be some-times hard to generalize those approaches in fully unsupervised set-ting where some prior knowledge is unavailable or the qualities ofthe data itself preserve unknown for the researcher.An additional complication in devising nonliear-ICA algorithmslies in proposing an efﬁcient measure of independence, which opti-mization would encourage the model to disentangle the components.One of the most common nonlinear method is MISEP (Almeida,2003) which, similar to the popular INFOMAX algorithm (Bell and Jagiellonian University, Poland, email: [email protected] https://github.com/gmum/wica Sejnowski, 1995), uses the mutual information criterion. In conse-quence, the procedure involves the calculation of the Jacobian of themodeled nonlinear transformation, which often causes a computa-tional overhead when both the input and output dimensions are large.Another approach is applied in NICE (Nonlinear IndependentComponent Estimation) (Dinh et al., 2014). Authors propose a fullyinvertible neural network architecture where the Jacobian is triviallyobtained. The independent components are then estimated using themaximum likelihood criterion. The drawback of both MISEP andNICE is that they require choosing the prior distribution family ofthe unknown independent components. An alternative approach isgiven by ANICA (Adversarial nonlinear ICA) (Brakel and Bengio,2017), where the independence measure is directly learned in eachtask with the use of GAN-like adversarial method combined with anan autoencoder architecture. However, the introduction of a GAN-based independence measure results in an often unstable adversarialtraining.In this paper we present a competitive approach to nonlinear in-dependent components analysis – WICA (Nonlinear Weighted ICA).Crucial role in our approach is played by the conclusion from (Bedy-chaj et al., 2019), which proves that to verify nonlinear indepen-dence it is sufﬁcient to check the linear independence of the nor-mally weighted dataset, see Fig. 1. Based on this result we intro-duce weighted indepedence index ( wii ) which relies on computingweighted covariance and can be applied to the veriﬁcation of the non-linear independence, see Section 2. Consequently, the constructedWICA algorithm is based on simple operations on matrices, andtherefore is ideal for GPU calculation and parallel processing. Weconstruct it by incorporating the introduced cost function in a com-monly used in ICA problems auto-encoder framework (Brakel andBengio, 2017; Le et al., 2011), where the role of the decoder is tolimit the unmixing function so that the learned by the encoder inde-pendent components contained the information needed to reconstructthe inputs, see Section 3.We veriﬁed our algorithm in the case of a source signal separa-tion problem. In Section 6, we presented the results of WICA fornonlinear mixes of images and for the decomposition of electroen-cephalogram signals. It occurs that WICA outperforms other meth-ods of nonlinear ICA, both with respect to unmixing quality and thestability of the results, see Fig. 10.To fairly evaluate various nonlinear ICA methods in the caseof higher dimensional datasets, we introduce a measure index calledOTS based on Spearman’s rank correlation coefﬁcient. In the deﬁni-tion of OTS, similarly to the clustering accuracy (ACC) (Cai et al.,2005, 2010), we used optimal transport to obtain the minimal mis-match cost. This approach has its merit here, since the correspon-dence between the input coordinates and the reconstructed compo-nents in a higher dimensional space is nontrivial.Another important ingredient of this paper is the introduction a r X i v : . [ c s . L G ] D ec f a new and fully invertible nonlinear mixing function. In the caseof linear ICA, one can easily construct many experiment settings thatcan be used in order to evaluate and compare different methods. Suchstandards are unfortunately not present in the case of nonlinear ICA.Therefore it is not clear what kind of nonlinear mixing should beused in the benchmark experiments. In most cases the authors usuallyuse mixing functions, which correspond with the models architec-ture (Almeida, 2003; Brakel and Bengio, 2017). In contrast to suchmethodology, we propose a new iterative nonlinear mixing functionbased on the ﬂow models (Dinh et al., 2014; Kingma and Dhariwal,2018). This method does not relates to internal design of our networkarchitecture, is invertible and allows for chaining the task complex-ity by varying the number of iterations, making it a useful tool inveriﬁcation of the nonlinear ICA models. Let us consider a random vector X in R d with density f . Then X hasindependent components iff f factors as f ( x , x , . . . , x d ) = f ( x ) · f ( x ) · . . . · f d ( x d ) , for some densities f i , where i ∈ { , , . . . , d } . Those functions arecalled marginal densities of f . A related, but much weaker notion, isthe uncorrelatedness. We say that X has uncorrelated components,if the covariance of X is diagonal. Contrary to the independence,correlation has fast and easy to compute estimators. Components in-dependence implies uncorrelatedness, but the opposite is not valid,see Fig. 1. Figure 1.

Sample from a random vector which Pearson’s correlation isequal to zero (left), but the components are not independent. Since the compo-nents are not independent, one can choose Gaussian weights so that Pearson’scorrelation of weighted dataset is not zero (right).

Let us mention that there exist several measures which verifythe independence. One of the most well-known measures of indepen-dence of random vectors is the distance correlation (dCor) (Székelyet al., 2007), which is applied in (Matteson and Tsay, 2017) to solvethe linear ICA problem. Unfortunately, to verify the independence ofcomponents of the samples, dCor needs d N comparisons, where d is the dimension of the sample and N is the sample size. Moreover,even a simpliﬁed version of dCor which checks only pairwise inde-pendence has high complexity and does not obtain very good results(which can be seen in experiments from Section 6). This motivatesthe research into fast, stable and efﬁcient measures of indepedence,which are adapted to GPU processing. In this subsection we ﬁll this gap and introduce a method of verifyingindependence which is based on the covariance of the weighted data. The covariance scales well with respect to the sample size and datadimension, therefore the proposed covariance-based index inheritssimilar properties.To proceed further, let us introduce weighted random vectors.

Deﬁnition 2.1.

Let w : R d → R + be a bounded weighting function.By X w we denote a weighted random vector with a density f w ( x ) = w ( x ) f ( x ) (cid:82) w ( z ) f ( z ) dz . Observation 2.1.

Let X be a random vector which has independentcomponents, and let w be an arbitrary weighting function. Then X w has independent components as well. One of the main results of (Bedychaj et al., 2019) is that the strongversion of the inverse of the above theorem holds. Given m ∈ R d ,we consider the weighting of X by the standard normal gaussian withcenter at m ( N( m, I ) ): X [ m ] = X N( m, I ) . We quote the following result which follows directly from the proofof Theorem 2 from (Bedychaj et al., 2019):

Theorem 2.1.

Let X be a random vector, let p ∈ R d and r > be arbitrary. If X [ q ] has linearly independent components for every q ∈ B ( p, r ) , where B ( p, r ) is a ball with center in p and radius r ,then X has the independent components. Given sample X = ( x i ) ⊂ R d , vector p ∈ R d , and weights w i = N( p, I )( x i ) , we deﬁne the weighted sample as: X [ p ] = ( x i , w i ) . Then the mean and covariance for the weighted sample X [ p ] = ( x i , w i ) is given by: mean X w = 1 (cid:80) i w i (cid:88) i w i x i and cov X w = 1 (cid:80) i w i (cid:88) i w i ( x i − mean X w ) T ( x i − mean X w ) . The informal conclusion from the above theorem can be statedas follows: if cov X [ p ] is (approximately) diagonal for a sufﬁcientlylarge set of p , then the sample X was generated from a distributionwith independent components. Let us now deﬁne an index which will measure the distancefrom being independent. We deﬁne the weighted independence in-dex ( wii ( X, p )) as wii ( X, p ) = 2 d ( d − (cid:88) i

In the experiment, we sampled twenty points from

N(0 , I ) (x-axis). Then, we calculate weights of the points respec-tively to N(0 , I ) and N (cid:0) , d I (cid:1) . We present values of those weights(sorted decreasingly) in the case when the center is chosen according to N(0 , I ) vs. N (cid:0) , d I (cid:1) . One can see that weights derived from N (cid:0) , d I (cid:1) actually balance more data points, in contrary to N(0 , I ) which focus onsmaller amount of data ( N(0 , I ) converges to 0 earlier). Consider the case when the data come from the standard normaldistribution. For given weights w and density f we deﬁne measure P ( w, f ) as: P ( w, f ) = (cid:0)(cid:82) w ( x ) f ( x ) dx (cid:1) (cid:82) w ( x ) f ( x ) dx . (1)Observe that if w is constant on a subset U of some space S (forwhich functions w and f are well-deﬁned) and zero otherwise, thenthe above reduces to µ ( U ) , where µ is counting measure. Intuitively, P ( w, f ) returns the percentage of the population which has nontriv-ial weights. Let us consider the case when µ is given by the standard normaldensity w ( x ) = N( p, I )( x ) and our dataset is normalized as stated above. Then, directly from (1),one obtains: P ( w, f ) = (cid:0)(cid:82) N( p, I )( x )N(0 , I )( x ) dx (cid:1) (cid:82) N( p, I ) ( x )N(0 , I )( x ) dx Applying the formula for the product of two normal densities: N( m , Σ )( x ) · N( m , Σ )( x ) = c c N( m c , Σ c )( x ) , where c c = N( m − m , Σ + Σ )(0) , Σ c = (Σ − + Σ − ) − , and m c = Σ c (Σ − m + Σ − m ) , we get: (cid:90) N( p, I )( x )N(0 , I )( x ) dx = N( p, I )(0) , for the numerator, and (cid:90) N ( p, I ) ( x )N(0 , I )( x ) dx = N(0 , I )(0)N (cid:0) p, I (cid:1) (0) . for the denominator. The equation for the denominator follows fromthe simple fact that: N( p, I ) ( x ) = N(0 , I )(0) · N (cid:0) p, I (cid:1) ( x ) , Summarizing, we obtain that P (N( p, I ) , N(0 , I )) = N( p, I ) (0)N(0 , I )(0)N (cid:0) p, I (cid:1) (0)= (cid:0) (cid:1) D/ exp (cid:0) − (cid:107) p (cid:107) (cid:1) . (2)Normalizing (2) by its maximum obtained at , we get exp (cid:0) − (cid:107) p (cid:107) (cid:1) . Clearly if p would be chosen from the standard normal distribu-tion, the value of (cid:107) p (cid:107) for large dimensions equals approximately d ,and consequently the weights for the randomly chosen points willbecome concentrated at a single point (see Fig 2). To obtain the quo-tient approximately constant, we should choose p so that its norm isapproximately one. Hence, it leads to the choice of p from the distri-bution N (cid:0) , d I (cid:1) .One can observe, that if X ∼ N(0 , I ) , then we can sample from N (cid:0) , d I (cid:1) by taking the mean of d randomly chosen vectors from X .This leads to the following deﬁnition: eﬁnition 2.2. For the dataset X ⊂ R d , we deﬁne wii ( X ) = E { wii ( Y, p ) : p a mean of random d elements of Y } , where Y is a componentwise normalization of X and E stands forexpected value.Let us summarize why centering the weights at the mean of d ele-ments from the dataset has good properties: • if the data is restricted to some subspace S of the space, then meanalso belongs to S ; • if the data comes from normal distribution N( m, Σ) , then meanof d elements comes from N (cid:0) m, d Σ (cid:1) , • if the data has heavy tails (i.e. comes from Cauchy distribution),then the distribution of mean for d elements set can be close to theoriginal dataset mean. In this section we propose the WICA algorithm for nonlinear ICAdecomposition which exploits the wii ( X ) index in practice.Following (Brakel and Bengio, 2017), we use an auto-encoder (AE) architecture, which consists of an encoder function E : R d → Z and a complementary decoder function D : Z → R d .The role of the encoder is to learn a transformation of the data thatunmixes the latent components, utilizing some measure of indepen-dence (we use the wii ( X ) index). The decoder is responsible for lim-iting the encoder, so that the learned representation does not lose anyinformation about the input. In practice, this is implemented by si-multaneously minimizing the reconstruction error: rec _ error( X ; E , D ) = d (cid:88) i =1 (cid:107) x i − D ( E x i ) (cid:107) . Reducing the difference between the input and the output is crucialto recover unmixing mapping close to inverse of the mixing one.Thus our ﬁnal cost function is given by cost( X ; E , D ) = rec _ error( X ; E , D ) + β wii ( E X ) . (3)where β is a hyperparameter which aims to weight the role of recon-struction with that of independence (analogous to β -VAE (Higginset al., 2017)). The training procedure follows the steps: Algorithm 1

WICA1. Take mini-batch X (cid:48) from the dataset X .2. Normalize componentwise E X (cid:48) , to obtain Y

3. Compute p , . . . , p d , where p i is the mean of randomly cho-sen d elements from Y ,4. Minimize: rec _ error( X (cid:48) ; E , D ) + β wii ( Y ; p , . . . , p d ) . Let us start with a discussion of possible deﬁnitions of the nonlin-ear mixing function used for benchmarking the ICA methods. Inthe beginning we shortly explain some approaches used in the lin-ear ICA, and then move forward to propose a mixing which beneﬁtsfrom properties desired in the comparison of the results obtained bynonlinear ICA algorithms. In the case of linear ICA the experiments are usually conveyed onan artiﬁcial dataset, which is obtained by mixing two or more of in-dependent source signals. This allows for the comparison of the re-sults returned by the analyzed methods with the original indepen-dent components. In the real-world applications such a procedure isof course infeasible, but in experimental setting it provides a goodbasis for benchmarking different models. In classical ICA setup, cre-ating an artiﬁcial mixing function is equivalent to selecting a randominvertible matrix A , such that X = A · S , where S are the true sourcesand X are the observations, which are then passed to the evaluatedmethods. Such mixing is used by (Bedychaj et al., 2019; Hyvärinen,1999; Spurek et al., 2017).Unfortunately, there do not exist any mixing standards for the non-linear ICA problem. A common setup of the comparable environ-ments needed to test the nonlinear models of ICA is to interlacelinear mixes of signals with nonlinear functions (Almeida, 2003;Brakel and Bengio, 2017). During our experiments we found thatthe proposed methods of nonlinear mixes are ineffective in large di-mensions. The aforementioned approaches usually apply only a shal-low stack of linear projections followed by a nonlinearity. In conse-quence, the obtained observations are either close to the linear mix-ing (and therefore not hard enough to be properly challenging for thelinear models) or become degenerate (i.e. all points cluster towardszero). Results of such mixing techniques are presented on Fig. 3. (a) PNL - Itera-tion 1 (b) PNL - Itera-tion 3 (c) MLP - Itera-tion 1 (d) MLP - Itera-tion 3 Figure 3.

Results of the nonlinear mixing techniques proposed in (Brakeland Bengio, 2017) on a normalized synthetic lattice data. Post nonlinear mix-ing model (PNL) introduced only slight nonlinearities, which are not hardenough to solve even for the linear algorithms. On the other hand, the multi-layer perceptron mixing (MLP) technique collapses after just couple of itera-tions.(a) Iteration 0 (b) Iteration 10 (c) Iteration 20 (d) Iteration 30(e) Iteration 40 (f) Iteration 50 (g) Iteration 60 (h) Iteration 70

Figure 4.

Results of our proposition of mixing over normalized syntheticlattice data. One may observe that after multiple iterations of the proposedmixing, results become highly nonlinear but not degenerate into any obscuresolutions known from previous setup.

Because of aforementioned disadvantages we propose our ownixing, inspired by (Kingma and Dhariwal, 2018; Dinh et al., 2014)network architecture. Let S be a sample of vectors with indepen-dent components. We apply a random isometry on S , by taking X = (cid:0) UV T (cid:1) S , where UV T comes from the Singular Value De-composition on a random matrix A ij ∼ N(0 , . Next we split X ∈ R d into half ( x i , x j ) → ( x i , x j + φ ( x i )) , similarly as it was done in (Kingma and Dhariwal, 2018). Function φ is a randomly initialized neural network with two hidden layers and tanh activations after each of them. This approach can be iteratedover multiple times to achieve the desired level of nonlinear mixing.Mixing procedure can be described in an algorithmic way: Algorithm 2

Nonlinear mixingTake dataset S .1. Take random isometry:(a) Take A from N(0 , I ) , such that a ij ∼ N(0 , (b) Take SVD of A , such that A = U Σ V T (c) Return UV T

2. Take X = (cid:0) UV T (cid:1) S

3. Split X ∈ R D in half: ( x i , x j ) → ( x i , x j + φ ( x i )) where φ is a randomly initialized neural network and x i , x j comefrom the split of X into half.4. Return X = AX One can easily increase the number of mixesand interlude splits of X in reverse order sothat ( x i , x j ) → ( x i + φ ( x j ) , x j ) for even and ( x i , x j ) → ( x i , x j + φ ( x i )) for odd iterate. The effects ofapplying the proposed mixing to two-dimensional data are presentedin Fig. 4.Our mixing procedure scales well in higher dimensions by iter-ating over the splits in R d . Additionally, it is also easily invertible,therefore there is a guarantee that the source components may be re-trieved. For the benchmark experiments we want to be able to measurethe similarity between the obtained results Z and the originalsources S . In the case of linear mixing the common choice is themaximum absolute correlation over all possible permutations of thesignals (denoted hereafter as max _ corr (Hyvarinen and Morioka,2016; Hyvärinen and Morioka, 2017; Hyvärinen et al., 2019; Spureket al., 2020; Zheng et al., 2007; Bengio et al., 2013; Hyvärinen,1999)).However, this measure is based on the Pearson’s correlation coef-ﬁcient and therefore is not able to catch any high order dependen-cies. To address this problem we introduce a new measure basedon the nonlinear Spearman’s rank correlation coefﬁcient and optimaltransport.Let the Z denote the signal retrieved by an ICA algorithm andlet the r s (cid:0) z j , s k (cid:1) be the Spearman’s rank correlation coefﬁcient be-tween the j -th component of Z and k -th component of S . We deﬁne the Spearman’s distance matrix M ( Z, S ) as M ( Z, S ) = (cid:104) − (cid:12)(cid:12)(cid:12) r s ( z j , s k ) (cid:12)(cid:12)(cid:12)(cid:105) j,k =1 , ,...d , where the zero entries indicate a monotonic relationship betweenthe corresponding features.This matrix is then used as the transportation cost of the compo-nents. Formally, we compute the value of the optimal transport prob-lem formulated in terms of integer linear programming:OTS = 1 − I s ( Z, S ) ,I s ( Z, S ) = min γ D (cid:88) j,k γ j,k M ( Z, S ) j,k , subject to: d (cid:88) k γ j,k = A j for all j ∈ { , , . . . , d } , d (cid:88) j γ j,k = A k for all k ∈ { , , . . . , d } ,γ j,k ∈ { , } for all j, k ∈ { , , . . . , d } , where A j = A k = 1 .As a result of the last constraint, the obtained transport plan γ de-ﬁnes a one-to-one map from the retrieved signals to the originalsources. In addition, the proposed Spearman-based measure (OTS)is sensitive to monotonic nonlinear dependencies and also relativelyeasy to compute with the use of existing tools for integer program-ming.Another difference between OTS and max _ corr is that the lat-ter favors stronger disentanglement of few components, while OTSgives lower results for outcomes that decompose the observationmore equally. In other words consider an experiment in which n sig-nals were mixed. Further, assume that some (nonlinear) ICA algo-rithm failed to unmix all but one component (i.e. only one unmixedcomponent matches exactly one source signal, while the rest is stillhighly unrecognizable). In such situation the max _ corr value willbe signiﬁcantly higher than OTS, although only the small portion ofthe base dataset was recovered.In order to empirically demonstrate this property, we artiﬁciallymixed a multidimensional grid using the mixing function from Sec-tion 4. Next, we randomly swapped one of the mixed signals withthe original signal from the base dataset. We compared this mixed-and-swapped data to the source signals using max _ corr and OTS.The results over different mixing iterations are presented on Fig. 5.One may observe that max _ corr values are always above the OTSones, suggesting that max _ corr measure prefers such a recoverymore than OTS. Naturally, in the case when all signals are far differ-ent from the true sources, values for max _ corr and OTS are almostexactly the same (see Fig. 6).In consequence, the max _ corr measure can help to asses themaximum of informativeness from the retrieved signal. This can bedesired in situations that favor well decomposition of few compo-nents at the cost of lower correlatedness of the remaining ones (whichmay happen, for instance, in denoising problems). In the case whereapproximately equal recovery of all the signals is requested, the OTSmeasure would be a better choice. dimensions 4 dimensions6 dimensions 8 dimensions Figure 5.

Results of the experiment where in n –dimensional mixed observation one component was swapped with a randomly chosen source signal. One mayobserve that max _ corr almost always prefers such situation, while OTS seems to be more rigorous. Figure 6.

Results for the OTS and max _ corr values for fully mixed dataset. One may observe that both measures in this case give similar outcomes. In this section we show several simulated experiments to validate theWICA algorithm empirically. Because there is no clear benchmarkdeﬁnition for the nonlinear ICA evaluation, we have selected most ﬁgurative and easily interpretable setup which we present in the fol-lowing subsections. In addition, we performed the analysis of elec-troencephalographic (EEG) signal according to procedure presentedin (isha SunLISHA SUN et al., 2005; Onton and Makeig, 2006), to r i g i n a l M i x e d F a s t I C AAN I C A P N L M I S E P d C o r W I C A Figure 7.

Two dimensional example of the problem of unmixing naturalimages. One can easily spot that WICA has the smallest amount of artifactsremained after retrieving the signals. All of the scatter plots were normalizedand are presented in the same scale. It is valuable to also look at the attachedmarginal histograms, where some of the similarities between the original sig-nal and its retrieved counterpart may be observed. validate our method in more natural setting, that is, without artiﬁ-cially generated mixing and access to true source components.

We start from the simulated example of the ICA application in thecase of images separation problem. We use this regime because theresults can be understand with the naked eye of a reader.To construct this experiment one needs to apply some artiﬁcialmixing function (i.e. linear transformation or mixing function fromSection 4) on the independent source signals. Such mixture is thenpassed to the ICA model in question to perform the unmixing task.In order to compare the WICA algorithm to other nonlinear ICAapproaches we evaluated the models performance in the case ofseparation of artiﬁcially mixed images. As an initial setup for thisblind source separation task, we randomly sampled two ﬂattened im-ages from the Berkeley Segmentation Dataset (Martin et al., 2001) and mixed them using the function deﬁned in Section 4. We com-pared the proposed our method with dCor (Spurek et al., 2020),PNLMISEP (Zheng et al., 2007), ANICA (Bengio et al., 2013) andlinear FastICA. Results of this toy example are presented on Fig. 7. available at Besides the retrieved images and their scatter plots, we alsodemonstrated projection of marginal densities. The desired goal is toachieve similar images and marginal densities as in the source (orig-inal) pictures.One can easily spot that FastICA and dCor seem to only rotatethe mixed signals. The ANICA, on the other hand, transformed theobservations to a high extent, but the recovered signals are visu-ally worse than the original pictures. Similarly to previous algorithm,PNLMISEP and WICA also performed some nontrivial shift on themarginal densities, but in this case the retrieved densities resemblethe original ones more naturally.This experiment was fully qualitative and the outcome is subjectto one’s individual perception. We demonstrated the images purelyas a visualization of the different ICA models performance in simplenonlinear setup. We report quantitative results in the next subsection.

Figure 8.

The mean rank results for different mixes measured by max _ corr (top) and OTS (bottom). The lower the better. From the preliminary results reported in previous subsection, wemoved to a more complex scenario in which we quantitative eval-uated the ICA methods in a higher dimensional setup.We uniformly sampled d ﬂattened images from the Berkeley Seg-mentation Dataset (Martin et al., 2001) to form the source compo-nents. We used ﬁve different source dimensions d ∈ { , , , , } .The observations were then obtained by using the function describedin Section 4, applied iteratively i ∈ { , , , , } times. Foreach dimension d we randomly picked different sets of source im-ages. Every method was evaluated times on each set of sources,dimensions and mixes.We ﬁt each nonlinear algorithm using the grid search overthe learning rate. For the auto-encoder based models we also per-formed a grid search over the scaling of the independence measure.Adjustment of these hyper-parameters was done on randomly sam-pled observations from the set of all obtained mixtures. Examplesused to tune the architectures, were then excluded from the dataseton which we performed the actual evaluation. It is worth to men-tion that we had to ﬁx batch size to , because any bigger valuecaused instabilities in the ANICA results. To be fair in comparisons,we set the same neural net architecture for WICA, ANICA and dCor.Both the encoder and the decoder were composed of 3 hidden lay-ers with neurons each. In the case of MISEP we used the PNLversion from (Zheng et al., 2007). The outcomes from each methodwere measured both by max _ corr and OTS against the true sourcecomponents. a) Original EEG signals (b) Retrieved by WICA (c) Retrieved by FastICA Figure 9.

Results of analysis done on the EEG signals. After the deletion of a suspicious signals selected by an expert from the decomposition, one can easilyspot that the reconstructed components are more homogeneous, and do not have as much artifacts as the original EEG data. In both methods the same amount ofsignals was cleared. The results are satisfying in either of the cases. Additionally WICA persist scale of the retrieved signals, which is helpful property in furthercleansing of the EEG data.

Figure 10.

Comparison between standard ICA methods (PNLMISEP, dCor, ANICA, FastICA) and our approach by using OTS (left) and max _ corr (right)measures in the setup where mixing iterations were performed. In the experiment we train ﬁve models and present the mean and standard deviation of eachof the used measures (the higher the better). One can observe that WICA consistently obtains good results for all of the dimensions and outperforms the othermethods in higher dimensions. Moreover, it has the lowest standard deviation across all the nonlinear algorithms. More numerical results of the experiment arepresented in Table 1. Performance across different dimensions.

We plotted the resultsof this experiment on mixes with respect to the data dimension d in Fig. 10. The outcomes demonstrated that the WICA methodoutperformed any other nonlinear algorithm in the proposed task byachieving high and stable results regardless of the considered datadimension. In the case of the results stability, WICA losses only tothe linear method – FastICA – which, unfortunately, cannot satis-factorily factorize nonlinear data. This experiment demonstrated thatWICA is a strong competitor to other models in a fully unsupervisedenvironment for nonlinear ICA.It is also worth to mention the difference between the results mea-sured by OTS and max _ corr for the ANICA and FastICA modelsapplied in high dimension. We hypothesize that this may indicatethat those algorithms were able to retrieve very well only small sub-set of the components, while the remaining variables were still highly We considered the setting with mixing iterations as the hardest one. mixed, leading to a similar effect as the one described in Section 5. Performance across different mixes.

For every model we evalu-ated the mean OTS and max _ corr score on a given dimension d andnumber of mixing iterations i . Then, for each pair ( d, i ) we rankedthe tested models based on their performance. We report the meanrank of models for each mixing iteration i in Fig. 8 (the lower thebetter).One may observed that for tasks relatively similar to the linearcase, where number of mixes is equal to , the PNLMISEP methodperforms the best both on max _ corr and OTS. However, as the num-ber of mixes increases, the WICA algorithm usually outperforms allthe other methods in both measures, achieving the lowest mean rank.As a complement to the above discussion we also provide the com-plete numerical results for all mixtures on all tested dimensions inTable 1. easure Dim Mixes WICA FastICA ANICA dCor PNLMISEP max _ corr ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 1.

Comparison between nonlinear ICA methods (PNLMISEP, dCor, ANICA, WICA) and the classical linear ICA approach (FastICA) on imagesseparation problem (with different dimensions) by using max _ corr and OTS measures. In the experiment we tuned and trained four models (excludingFastICA, which is a linear model) and present mean and standard deviation in the tabular form. Finally we want to show usability of the WICA method on real lifedata. An example of a task that can be tackle by the ICA algorithmsis electroencephalogram (EEG) decomposition.An EEG signal is a test used to evaluate the electrical activity in the brain. The brain cells communicate via electrical impulses andare active all the time. In the original scalp channel data, each row ofthe data recording matrix represents the time course of summed volt-age differences between source projections to one data channel andone or more reference channels. We followed a common experimentframework proposed in (isha SunLISHA SUN et al., 2005; Onton andakeig, 2006), to detect artifacts in unmixed signals representationwhich can suggest a blinks or an eye movement during the test.The setup for this decomposition is different than in previous sec-tions. An original EEG mixture took for this experiment, consistedof scalp electrode signals. Those signals were selected as an in-put for the WICA model. Retrieved data were analysed by an expert,who selected signs of a blinking on recovered components. Manuallyselected subset of suspicious components, were then nulliﬁed. Un-mixed signal with masked (by nulliﬁcation) components were thenfeed back to the decoder which came from the training of the WICAmodel.As a researcher we are not aware how deeply EEG signals aremixed or dependent. The crucial functionality that ICA serves in thissetting is normalizing and cleansing of the dataset. From that point,time series produced from recovered signals have to be analysed byan expert. In this experiment we want to prove that high dimensionof the input data and the unknown entanglement of the componentsis not a limitation for the WICA. Visual results of this experimentare presented on Fig 9. For a comparison we used results from otherstandard ICA algorithm used for this kind of a task – linear FastICA.The details of "remixing" process for this method are descirbed in(isha SunLISHA SUN et al., 2005). This experiment showed thatWICA is able to handle multidimensional data highly above the vol-ume tested for other nonlinear models. Moreover, results our methodfor this task works well enough to be used as a preliminary step ofcleaning the data. In this paper we presented a new approach to the nonlinear ICA task.In addition to the investigation of WICA method, which proves tobe matching the results of all other tested nonlinear algorithms, weproposed a new mixing function for validating nonlinear tasks in astructurized manner. Our mixing scales to higher dimensions and iseasily invertible.Lastly, we deﬁned OTS, a measure that can catch nonlinear depen-dence and is easy to compute. The OTS measure and the proposedmixing have the potential to become benchmarking tools for all fu-ture work in this ﬁeld.

The work of P. Spurek was supported by the National Centre of Sci-ence (Poland) Grant No. 2019/33/B/ST6/00894. The work of J. Ta-bor was supported by the National Centre of Science (Poland) GrantNo. 2017/25/B/ST6/01271. A. Nowak carried out this work withinthe research project "Bio-inspired artiﬁcial neural networks" (grantno. POIR.04.04.00-00-14DE/18-00) within the Team-Net programof the Foundation for Polish Science co-ﬁnanced by the EuropeanUnion under the European Regional Development Fund.

REFERENCES

Almeida, L. B. (2003), ‘Misep–linear and nonlinear ica basedon mutual information’,

Journal of Machine Learning Research (Dec), 1297–1318.Bedychaj, A., Spurek, P., Struski, Ł. and Tabor, J. (2019), ‘Indepen-dent component analysis based on multiple data-weighting’, arXivpreprint arXiv:1906.00028 .Bell, A. J. and Sejnowski, T. J. (1995), ‘An information-maximization approach to blind separation and blind deconvolu-tion’, Neural computation (6), 1129–1159. Bengio, Y., Courville, A. and Vincent, P. (2013), ‘Representationlearning: A review and new perspectives’, IEEE transactions onpattern analysis and machine intelligence (8), 1798–1828.Brakel, P. and Bengio, Y. (2017), ‘Learning independent fea-tures with adversarial nets for non-linear ica’, arXiv preprintarXiv:1710.05050 .Cai, D., He, X. and Han, J. (2005), ‘Document clustering using lo-cality preserving indexing’, IEEE Transactions on Knowledge andData Engineering (12), 1624–1637.Cai, D., He, X. and Han, J. (2010), ‘Locally consistent concept fac-torization for document clustering’, IEEE Transactions on Knowl-edge and Data Engineering (6), 902–913.Dinh, L., Krueger, D. and Bengio, Y. (2014), ‘Nice: Non-linear inde-pendent components estimation’, arXiv preprint arXiv:1410.8516 .Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick,M. M., Mohamed, S. and Lerchner, A. (2017), beta-vae: Learningbasic visual concepts with a constrained variational framework, in ‘ICLR’.Hyvärinen, A. (1999), ‘Fast and robust ﬁxed-point algorithms for in-dependent component analysis’, Neural Networks, IEEE Transac-tions on (3), 626–634.Hyvarinen, A. and Morioka, H. (2016), Unsupervised feature extrac-tion by time-contrastive learning and nonlinear ica, in ‘Advancesin Neural Information Processing Systems’, pp. 3765–3773.Hyvärinen, A. and Morioka, H. (2017), Nonlinear ica of tempo-rally dependent stationary sources, in ‘International Conferenceon Artiﬁcial Intelligence and Statistics’, Microtome Publishing,pp. 460–469.Hyvärinen, A. and Pajunen, P. (1999), ‘Nonlinear independent com-ponent analysis: Existence and uniqueness results’, Neural Net-works (3), 429–439.Hyvärinen, A., Sasaki, H. and Turner, R. E. (2019), Nonlinear icausing auxiliary variables and generalized contrastive learning, in ‘The 22nd International Conference on Artiﬁcial Intelligence andStatistics’, Journal of Machine Learning Research, pp. 859–868.Khemakhem, I., Kingma, D. P. and Hyvärinen, A. (2019), ‘Vari-ational autoencoders and nonlinear ica: A unifying framework’, arXiv preprint arXiv:1907.04809 .Kingma, D. P. and Dhariwal, P. (2018), Glow: Generative ﬂow withinvertible 1x1 convolutions, in ‘Advances in Neural InformationProcessing Systems’, pp. 10215–10224.Larson, L. E. (1998), ‘Radio frequency integrated circuit technologyfor low-power wireless communications’, IEEE Personal Commu-nications (3), 11–19.Le, Q. V., Karpenko, A., Ngiam, J. and Ng, A. Y. (2011), Ica withreconstruction cost for efﬁcient overcomplete feature learning, in ‘Advances in Neural Information Processing Systems’, pp. 1017–1025.Lisha Sun, Ying Liu and Beadle, P. J. (2005), Independent compo-nent analysis of eeg signals, in ‘Proceedings of 2005 IEEE Interna-tional Workshop on VLSI Design and Video Technology, 2005.’,pp. 219–222.Martin, D., Fowlkes, C., Tal, D. and Malik, J. (2001), A database ofhuman segmented natural images and its application to evaluat-ing segmentation algorithms and measuring ecological statistics, in ‘Proc. 8th Int’l Conf. Computer Vision’, Vol. 2, pp. 416–423.Matteson, D. S. and Tsay, R. S. (2017), ‘Independent componentanalysis via distance covariance’, Journal of the American Sta-tistical Association pp. 1–16.Onton, J. and Makeig, S. (2006), Information-based modeling ofvent-related brain dynamics, in ‘Progress in Brain Research’, El-sevier, pp. 99–120.Spurek, P., Nowak, A., Tabor, J., Maziarka, Ł. and Jastrz˛ebski, S.(2020), Non-linear ica based on cramer-wold metric, in ‘Interna-tional Conference on Neural Information Processing’, Springer,pp. 294–305.Spurek, P., Tabor, J., Rola, P. and Ociepka, M. (2017), ‘Ica based onasymmetry’, Pattern Recognition , 230–244.Székely, G. J., Rizzo, M. L., Bakirov, N. K. et al. (2007), ‘Measuringand testing dependence by correlation of distances’, The annals ofstatistics (6), 2769–2794.Zheng, C.-H., Huang, D.-S., Li, K., Irwin, G. and Sun, Z.-L. (2007),‘Misep method for postnonlinear blind source separation’, Neuralcomputation , 2557–78.Ziehe, A., Muller, K.-R., Nolte, G., Mackert, B.-M. and Curio,G. (2000), ‘Artifact reduction in magnetoneurography based ontime-delayed second-order correlations’, IEEE Transactions onbiomedical Engineering47