aa r X i v : . [ c s . I T ] J u l Independent Component Analysis OverGalois Fields
Arie Yeredor,
Senior Member, IEEE
Abstract
We consider the framework of Independent Component Analysis (ICA) for the case where theindependent sources and their linear mixtures all reside in a Galois field of prime order P . Similaritiesand differences from the classical ICA framework (over the Real field) are explored. We show thata necessary and sufficient identifiability condition is that none of the sources should have a Uniformdistribution. We also show that pairwise independence of the mixtures implies their full mutualindependence (namely a non-mixing condition) in the binary ( P = 2 ) and ternary ( P = 3 ) cases,but not necessarily in higher order ( P > ) cases. We propose two different iterative separation (oridentification) algorithms: One is based on sequential identification of the smallest-entropy linearcombinations of the mixtures, and is shown to be equivariant with respect to the mixing matrix; Theother is based on sequential minimization of the pairwise mutual information measures. We providesome basic performance analysis for the binary ( P = 2 ) case, supplemented by simulation results forhigher orders, demonstrating advantages and disadvantages of the proposed separation approaches. I. I
NTRODUCTION
Independent Component Analysis (ICA, see, e.g., [2], [3], [4] for some of the fundamental princi-ples) addresses the recovery of unobserved, statistically independent source signals from their observedlinear (and invertible) mixtures, without prior knowledge of the mixing matrix or of the sources’statistics. Classically, the ICA framework assumes that the sources and the mixing (hence, also theobservations) are defined over the field of real-valued numbers R , with some exceptions (e.g., [5])that assume the field of complex-valued numbers C . It might be interesting, though, at least from atheoretical point of view, to explore the applicability of ICA principles in other algebraic fields.In this work we consider ICA over Galois Fields of prime order P , denoted GF ( P ) , such thatthe sources and the mixing-matrix’ elements can all take only a finite number of values, defined The author is with the Department of Electrical Engineering - Systems, Tel-Aviv University, Tel-Aviv, Israel. Some partof this work was presented at ICA’07 [1]. by the set { , , ..., P − } (or by some offset, isomorphic version thereof), and where addition andmultiplication are applied modulu P , thereby returning values in the same set.For example, in the field GF (2) of binary numbers { , } , addition is obviously equivalent to the“Exclusive Or” (XOR) operation, denoted z = x ⊕ y (where z equals if x = y and equals otherwise). Multiplication (either by or by ) is defined in the “usual” way in this case.In the field GF (3) of ternary numbers { , , } , where addition and multiplication are definedmodulu (similarly denoted z = x ⊕ y ), it is sometimes more convenient to consider the offsetgroup { , , − } . In this group, multiplication can still be defined in the “usual” way, since ordinarymultiplication of any two numbers in this group returns a number in the group. Obviously, the twosets { , , } and { , , − } are isomorphic in GF (3) , and will be used interchangeably in the sequel.A fundamental difference, at least in the context of ICA, between random variables over R andover GF ( P ) is the following: Let u and v be two statistically independent, non-degenerate (namely,non-deterministic) random variables, and consider the random variable w , given by any non-triviallinear combination of u and v . In R , v and w cannot be statistically independent (they are obviouslycorrelated), no matter how u and v are distributed. However, as we shall show in Section III, in GF ( P ) v and w may indeed be statistically independent, and this happens if and only if the distribution of u is uniform (taking each of the P values with equal probabilities).In a sense, this property tags the uniform distribution as the “problematic” distribution in ICA over GF ( P ) , reminiscent of the role taken by the Gaussian distribution in ICA over R . Note that thesetwo distributions share additional related properties in their respective fields: They are both (undermild regularity conditions) limit-distributions of an infinite sum of independent random variables;and they are both “maximum entropy” distributions (subject to a variance constraint for the Gaussiandistribution in R ). So, loosely stated, in the same way that a linear combination of independentrandom variables over R tends to be “more Gaussian”, a linear combination of independent randomvariables over GF ( P ) tends to be “more uniform”.Nevertheless, there still remain some essential differences between the roles of these distributionsin the respective contexts. For example, in GF ( P ) , if (at least) one of random variables in the linearcombination of independent variables is uniform, the resulting distribution would be exactly uniformas well, no matter how the other random variables are distributed. Evidently, this property does nothold for Gaussian distributions over R .Therefore, as we shall show, these properties lead to an identifiability condition for ICA over GF ( P ) , which is reminiscent of, but certainly not equivalent to, a well-known identifiability conditionover R . More specifically, the identifiability condition for ICA over R requires that not more thanone of the sources be Gaussian. Our identifiability condition for ICA over GF ( P ) requires that none of the sources be uniform. The key to this identifiability condition is the property that the entropy of any linear combination of statistically independent random variables over GF ( P ) is larger thanthe entropy of the largest-entropy component, as long as this component is not uniform. Therefore,if none of the sources is uniform, then, at least conceptually, a possible separation approach is tolook for the (inverse) linear transformation, which minimizes the empirical marginal entropies ofthe resulting linear combinations. However, since an exhaustive search for this transformation wouldoften be prohibitively computationally expensive, we shall propose an alternative, computationallycheaper method for entropy-based identification.Another possible, somewhat different separation approach is the following. One of the key obser-vations in ICA over R is that, under the identifiability condition and due to the Darmois-Skitovitchtheorem (e.g., [6], p.218), pairwise-independence of the mixtures implies their full mutual indepen-dence, which in turn implies a non-mixing condition (namely, separation). Interestingly, we shall showthat our general identifiability condition is necessary and sufficient to guarantee a similar propertyfor ICA over GF (2) and GF (3) , but is generally insufficient for this property to hold in GF ( P ) for P > . Thus, another possible identification approach (in GF (2) and in GF (3) only) is to look for aninvertible linear transformation of the observations, which makes the resulting signals “as empiricallypairwise-independent as possible” - a property which is easier to quantify and measure than fullindependence (being quadratic, rather than exponential, in the number of sources K ). Again - sincean exhaustive search is often not feasible, we shall propose a different, sequential method for thisapproach.A common assumption in the design and analysis of classical ICA methods over R , is that each ofthe sources has an independent, identically distributed (iid) time-structure. Our discussion in this paperwould be similarly restricted along the same line. We note, however, that in equivalence to methodswhich exploit possibly different temporal structures (e.g., spectral diversity [7], non-stationarity [8],etc.) over R , similar extensions of our results would be possible in similar cases over GF ( P ) . However,we defer the exploration of such cases to future work.The paper is structured as follows. In the next section we review some fundamental properties ofrandom variables and random vectors in GF ( P ) , which will be useful in subsequent derivations. InSection III we outline the problem formulation and present our general identifiability condition. InSection IV we explore the relation between pairwise independence and full independence, showingthat in an invertible linear mixture, the former implies the latter in GF (2) and in GF (3) , but notnecessarily in Galois fields of higher orders. We then proceed to propose two different separationalgorithm in Section V. A rudimentary performance analysis for the simple binary case ( P = 2 ) isprovided in Section VI, supplemented with supporting simulation results which extend to larger-scalescenarios. Our work is summarized with concluding remarks in Section VII.We shall denote addition, subtraction and multiplication over GF ( P ) (namely, modulu P ) by ⊕ , ⊖ and ⊗ , respectively, with multiplication preceding addition and subtraction in the order of operations.Vector multiplication will be denoted by ◦ , such that if a = [ a · · · a K ] T and x = [ x · · · x K ] T ,then a T ◦ x = a ⊗ x ⊕ a ⊗ x ⊕ · · · ⊕ a K ⊗ x K . (1)Similarly, if A is an L × K matrix in GF ( P ) , its product with x is denoted A ◦ x , an L × vectorwhose elements are the products of the respective rows of A with x .II. C HARACTERIZATION OF RANDOM VARIABLES AND RANDOM VECTORS IN GF ( P ) We begin by briefly outlining some of the basic essential properties and definitions of our notationsfor random variables and random vectors in GF ( P ) , which we shall use in the sequel.A random variable u in GF ( P ) is characterized by a discrete probability distribution, fully describedby a vector p u = [ p u (0) p u (1) · · · p u ( P − T ∈ R P , whose elements p u ( m ) are Pr { u = m } ,the probabilities of u taking the values m ∈ { , . . . , P − } . Evidently, all the elements of p u arenon-negative and their sum equals . We shall refer to p u as the probability vector of u . The entropy of u is given by H ( u ) = − P − X m =0 p u ( m ) log p u ( m ) . (2)By maximizing with respect to p u , it is easy to show that among all random variables in GF ( P ) ,the uniform random variable (taking all values in GF ( P ) with equal probability P ) has the largestentropy, given by log P . Note that it is convenient to use a base- P logarithm log P (rather than themore commonly-used log ) in this context, such that the entropies of all (scalar) random variables in GF ( P ) are confined to [0 , . Note, in addition, that since multiplication by a constant over GF ( P ) is bijective, the entropy of a random variable in GF ( P ) is invariant under such multiplication (whichmerely re-arranges the terms in the sum (2)).The characteristic vector of u is denoted ˜ p u = [˜ p u (0) ˜ p u (1) · · · ˜ p u ( P − T ∈ C P , and itselements are given by the discrete Fourier transform (DFT) of the elements of p : ˜ p u ( n ) = E [ W nuP ] = P − X m =0 p u ( m ) W mnP n = 0 , . . . , P − , (3)where the “twiddle factor” W P is defined as W P = e − j π/P (note that the modulu- P operation isinherently present in the exponential part, so W mnP is equivalent to W m ⊗ nP ). Like the probabilityvector p u , the characteristic vector ˜ p u provides full statistical characterization of the random variable u , since p u can be directly obtained from ˜ p u using the inverse DFT.The following basic properties of ˜ p u can be easily obtained:P1) ˜ p u (0) = 1 ; P2) Since p u is real-valued, ˜ p u ( n ) = ˜ p ∗ u ( P − n ) (where the superscript ∗ denotes the complex-conjugate);P3) u is uniform (namely, p u ( m ) = P ∀ m ) ⇔ ˜ p u ( n ) = 0 ∀ n = 0 ;P4) u is degenerate (namely, p u ( M ) = 1 for some M ) ⇔ ˜ p u ( n ) = W nMP ∀ n ;P5) | ˜ p u ( n ) | ≤ ∀ n , where for n = 0 equality holds if and only if (iff) u is degenerate.Note that in the particular cases of GF (2) and GF (3) we have the following simplifications: • In GF (2) , the only free parameter in ˜ p u ∈ R is ˜ p u (1) , to which we shall refer as θ u △ = ˜ p u (1) = p u (0) − p u (1) = 1 − p u (1) . (4)Thus ˜ p u = [1 θ u ] T ; • In GF (3) , there is also a single (yet complex-valued) free parameter in ˜ p u ∈ C , to which weshall refer as ξ u △ = ˜ p u (1) = p u (0)+ p u (1) W − + p u (2) W − = 1 − ( p u (1)+ p u (2))+ j √ ( p u (2) − p u (1)) . (5)Thus ˜ p u = [1 ξ u ξ ∗ u ] T .Note also that θ u = E [ W u ] = E [( − u ] and ξ u = E [ W u ] .For two random variables u and v in GF ( P ) , the joint statistics are completely described by the joint probabilities matrix P u,v ∈ R P × P , whose elements are P u,v ( m, n ) = Pr { u = m, v = n } , m, n ∈ { , . . . , P − } . The joint entropy of u and v is given by H ( u, v ) = − P − X m,n =0 P u,v ( m, n ) log P u,v ( m, n ) . (6)The random variables u and v are said to be statistically independent iff P u,v = p u p Tv . By Jensen’sinequality, H ( u, v ) satisfies H ( u, v ) ≤ H ( u ) + H ( v ) , with equality iff u and v are statistically inde-pendent. The mutual information between u and v is the difference I ( u, v ) △ = H ( u )+ H ( v ) − H ( u, v ) ,which is also the (non-negative) Kullback-Leibler divergence between their joint distribution and theproduct of their marginal distributions. The smaller their mutual information, the “more statisticallyindependent” u and v are; I ( u, v ) vanishes if and only if u and v are statistically independent.The conditional distribution of u given v is given by P u | v ∈ R P × P with elements P u | v ( m, n ) = P u,v ( m, n ) /p v ( n ) = Pr { u = m | v = n } , m, n = 0 , . . . , P − . The conditional entropy is defined as H ( u | v ) = − P − X n =0 p v ( n ) P − X m =1 P u | v ( m, n ) log P u | v ( m, n ) , (7)which can be easily shown to satisfy H ( u | v ) = H ( u, v ) − H ( v ) .The joint characteristic matrix of u and v , denoted e P u,v ∈ C P × P , is given by the two-dimensionalDFT (2DFT) of P u,v , e P u,v ( m, n ) = E [ W mu + nvP ] = P − X k,ℓ =0 P u,v ( k, ℓ ) W mk + nℓP , (8) and provides an alternative full statistical characterization of u and v . In particular, it is straightforwardto show that e P u,v satisfies e P u,v = ˜ p u ˜ p Tv iff u and v are statistically independent.For a K × random vector u whose elements u , . . . , u K are random variables in GF ( P ) , the jointstatistics are fully characterized by the K -way probabilities tensor P u ∈ R P ( × K ) , whose elements arethe probabilities P u ( m , . . . , m K ) = Pr { u = m , . . . , u K = m K } , m , . . . , m K ∈ { , . . . , P − } .Using vector-index notations, where m = [ m , · · · , m K ] T , we may also express this relation morecompactly as P u ( m ) = Pr { u = m } . The characteristic tensor e P u ∈ C P ( × K ) is given by the K -dimesional DFT of P u , which, using a similar index-vector notation, is given by e P u ( n ) = E [ W n T u P ] = X m P u ( m ) W n T m P . (9)where the summation extends over all possible P K indices combinations in m .III. P ROBLEM F ORMULATION AND I NDENTIFIABILITY
We are now ready to formulate the mixture model over GF ( P ) . Assume that there are K statisticallyindependent random source signals denoted s [ t ] = [ s [ t ] s [ t ] · · · s K [ t ]] T , each with an iid time-structure, such that at each time-instant t , s k [ t ] is an independent realizations of a random variablein GF ( P ) , characterized by the (unknown) distribution vector p k .Let these sources be mixed (over GF ( P ) ) by an unknown, square ( K × K ) mixing matrix A (withelements in GF ( P ) ), x [ t ] = A ◦ s [ t ] . (10)We further assume that A is invertible over the field, namely that it has a unique inverse over GF ( P ) , denoted B △ = A − , satisfying B ◦ A = A ◦ B = I , where I denotes the K × K identitymatrix. Like in “classical” linear algebra (over R ), A is non-singular (invertible) iff its determinant is non-zero. Equivalently, A is singular iff there exists (in GF ( P ) ) a nonzero vector u , such that A ◦ u = (an all-zeros vector).We are interested in the identifiability, possibly up to some tolerable ambiguities, of A (or,equivalently, of its inverse B ) from the set of observations x [ t ] , t = 1 , , ...T under asymptoticconditions, namely as T → ∞ . Due to the assumption of iid samples for each source (implyingergodicity), the joint statistics of the observations can be fully and consistently estimated from theavailable data. Therefore, the assumption of asymptotic conditions implies full and exact knowledgeof the joint probability distribution tensor P x of the observation vector x (we dropped the time-index The determinant over GF ( P ) can be calculated in a similar way to calculating the determinant over R , using the field’saddition/subtraction and multiplication operations. t here, due to the stationarity). The remaining question is, therefore - whether, and if so, under whatconditions, A can be identified (up to tolerable ambiguities) from exact, full knowledge of P x .To answer this question, we first explore some basic statistical properties of linear combinations ofrandom variables over GF ( P ) . The characteristic vectors are particularly useful for this analysis. Let u and v denote two statistically independent random variables in GF ( P ) with probability vectors p u and p v and characteristic vectors ˜ p u and ˜ p v , respectively. If w = u ⊕ v , then the probability vector p w of w is given by the cyclic convolution between p u and p v , and the characteristic vector ˜ p w istherefore given by the element-wise product of ˜ p u and ˜ p v : p w ( n ) = P − X m =0 Pr { u = m, v = n ⊖ m } = P − X m =0 p u ( m ) p v ( n ⊖ m ) ⇔ ˜ p w ( n ) = ˜ p u ( n )˜ p v ( n ) ∀ n. (11)Two intuitively appealing (nearly trivial) properties follow from this relation. First, combined withProperty P4 (in Section II), this relation implies that the sum (over GF ( P ) ) of two independentrandom variables is a degenerate random variable iff both are degenerate. Likewise, combined withProperty P3, this relation implies that the sum is uniform if at least one of the variables is uniform.The converse, however, is perhaps somewhat less trivial, since it involves a distinction between GF (2) and GF (3) on one hand, and GF ( P ) with P > on the other hand, as suggested by the followinglemma: Lemma 1:
Let u and v be two statistically independent random variables in GF ( P ) , and let w △ = u ⊕ v . If both u and v are non-uniform, then:1) If P = 2 or P = 3 , w is also non-uniform;2) If P > , w may or may not be uniform. Proof:
By Property P3, w would be uniform iff for each n = 0 , either ˜ p u ( n ) = 0 or ˜ p v ( n ) = 0 (or both). In GF (2) this can only happen if either θ u = 0 or θ v = 0 (or both), which implies that atleast one of the two variables is uniform. Likewise, in GF (3) this can only happen if either ξ u = 0 or ξ v = 0 (or both), leading to a similar conclusion.However, for P > there are sufficiently many degrees of freedom in the characteristic vectors of u and v to allow both non-zero and zero elements in both ˜ p u and ˜ p v , as long as at each n = 0 eitherone is zero. For example, consider P = 5 with ˜ p u = [1 0 0 . . T and ˜ p v = [1 0 . . T . Thiscorresponds to p u ≈ [0 .
32 0 .
10 0 .
24 0 .
24 0 . T and p v ≈ [0 .
36 0 .
25 0 .
07 0 .
07 0 . T , which areclearly non-uniform. However, if these u and v are independent, their sum (over GF (5) ) is a uniformrandom variable.Note, in addition, that since multiplication by a constant in GF ( P ) is bijective, uniform or degener-ate random variables cannot become non-uniform or non-degenerate (nor vice-versa) by multiplicationwith a constant. Consequently, the above conclusions and Lemma 1 hold not only for the sum of tworandom variables, but also for any linear combination (over GF ( P ) ) thereof. We now add the following Lemma:
Lemma 2:
Let u and v be two statistically independent, non-degenerate random variables in GF ( P ) ,and let w △ = u ⊕ v . Then v and w are statistically independent iff u is uniform. Proof:
The joint probability distribution of v and w is given by P v,w ( m, n ) = Pr { v = m, w = n } = Pr { v = m, u = n ⊖ m } = p v ( m ) p u ( n ⊖ m ) . (12)Now, w and v are independent iff this probability equals p v ( m ) p w ( n ) for all m, n , namely iff p u ( n ⊖ m ) = p w ( n ) for all n and for all m with which p v ( m ) = 0 . Since v is non-degenerate, there are atleast two such values of m . Denoting these values as m and m , this condition translates into p u ( n ⊖ m ) = p u ( n ⊖ m ) = p w ( n ) ∀ n. (13)We therefore also have p u ( n ) = p u ( n ⊕ m ⊖ m ) ∀ n , which can be recursively generalized into p u ( n ) = p u ( n ⊕ k ⊗ ( m ⊖ m )) ∀ n, k ∈ GF ( P ) . (14)Since P is prime, each element in GF ( P ) can be represented (given n , m and m ) as n ⊕ k ⊗ ( m ⊖ m ) with some k , therefore this condition is satisfied iff p u ( n ) is constant, namely iff u is uniform.To establish our identifiability condition we need one additional lemma, which characterizes theentropy of a linear combination of random variables in GF ( P ) . Lemma 3:
Let u and v be two statistically independent, non-degenerate random variables in GF ( P ) ,and let w △ = u ⊕ v . Then H ( w ) ≥ H ( u ) , where equality holds iff u is uniform. Proof:
As already mentioned in Section II, H ( w, v ) ≤ H ( w )+ H ( v ) , with equality iff w and v arestatistically independent. In addition, H ( w | v ) = H ( w, v ) − H ( v ) . Therefore, H ( w | v ) ≤ H ( w ) , withequality iff w and v are statistically independent. Next, from (12) we have P w | v ( m, n ) = p u ( n ⊖ m ) ,and therefore, as could be intuitively expected, H ( w | v ) = P − X m =0 p v ( m ) P − X n =0 p u ( n ⊖ m ) log p u ( n ⊖ m ) = P − X m =0 p v ( m ) H ( u ) = H ( u ) , (15)and we therefore conclude that H ( u ) ≤ H ( w ) , with equality iff w and v are statistically independent.Now, according to Lemma 2, w and v are statistically independent iff u is uniform, which completesthe proof.Obviously, a similar result (namely H ( w ) ≥ H ( v ) ) can be obtained by switching roles between u and v in the proof. Note an essential difference from a similar result over R : In R the entropy (ordifferential entropy) of a sum of two independent, non-degenerate random variables is always strictlylarger than their individual entropies, no matter how they are distributed. In GF ( P ) , however, equalityis attained if one of the variables is uniform. In fact, this equality is inevitable, simply because the entropy of any random variable in GF ( P ) is upper-bounded by the uniform variable’s entropy (of log P ).We are now ready to state our identifiability condition: Theorem 1:
Let s be a K × random vector whose elements are statistically-independent, non-degenerate random variables in GF ( P ) . Let A be a K × K non-singular matrix in GF ( P ) , and let therandom vector x be defined as x = A ◦ s . Assume that the probability distribution of x is fully known(specified by the probabilities tensor P x ). Then A can be identified, up to possible permutation andscaling of its columns, from P x alone, iff none of the elements of s is a uniform random variable . Proof:
The necessity of this condition is obvious by Lemma 2. Even in the simplest × case,if one of the sources, say s , is uniform, then by Lemma 2 any linear combination of s with theother source s is still statistically independent of s . Therefore, if the mixed signals are x = s ⊕ s and x = s , then x and x are statistically independent - so this situation is indistinguishable froma non-mixing observation of two independent sources with the same marginal distributions as x and x (which are also the marginal distributions of s and s (resp.) in this case).To observe the sufficiency of the condition, note first that since A is invertible over GF ( P ) ,any invertible linear mixture of the original sources s can be obtained by applying some invertiblelinear mixing to the observations x . Therefore, by applying all (finite number of) invertible lineartransformations to x , one can implicitly obtain all the invertible linear transformations of s . Indeed,let ˆ B denote an arbitrary invertible matrix in GF ( P ) , and denote y △ = ˆ B ◦ x = ( ˆ B ◦ A ) ◦ s (16)Since both ˆ B and A are non-singular, so is ˆ B ◦ A , which therefore:1) Has at least one non-zero element in each row; and2) Has at least one non-zero element in each column, which means that each element of s is acomponent of (namely, participates with nonzero weight in) at least one element of y .Now define the respective sums of (marginal) entropies, H mar ( y ) △ = P Kk =1 H ( y k ) and H mar ( s ) △ = P Kk =1 H ( s k ) . Consequently, by Lemma 3, H mar ( y ) cannot be made smaller than H mar ( s ) . Moreover,if none of the elements of s is uniform, then H mar ( y ) = H mar ( s ) ⇔ ˆ B ◦ A = Π ◦ Λ , (17)where Π denotes a K × K permutation matrix and Λ denotes a K × K diagonal, nonsingular matrixin GF ( P ) . Any other form of ˆ B ◦ A would imply that at least one of the elements of y is a linearcombination of at least two elements of s , and as such has higher entropy than both, and since atleast one of these two elements is also present in at least one other element of y , H mar ( y ) must belarger than H mar ( s ) . It is therefore possible, at least conceptually, to apply each K × K nonsingular matrix ˆ B in GF ( P ) to x , and select one of the minimizers of H mar ( y ) . The inverse of this minimizer is guaranteed tobe equivalent to A up to permutation and scaling, ˆ B ◦ A = Π ◦ Λ ⇔ ˆ B − = A ◦ Λ − ◦ Π T (18)(where all the inverses are obviously taken over GF ( P ) ).Note that in GF (2) the scaling ambiguity is meaningless, because the only possible scalar multi-plication is by , therefore only the permutation ambiguity remains. In GF (3) the possible scalingambiguity entails multiplication by either or , or, if the “offset group” { , , − } is used, thisambiguity merely translates into a sign-ambiguity.Although the number of K × K nonsingular matrices in GF ( P ) is finite, this number is of theorder of P ( K ) , which clearly becomes prohibitively large even with relatively small values of P and K . Therefore, our identifiability proof, which is based on an exhaustive search, can hardly betranslated into a practical separation scheme. Nevertheless, in Section V below we shall proposeand discuss two practical separation approaches, which require a significantly reduced computationaleffort. First, however, we need to address one more theoretical aspect of our model - which is: whether(and if so under what conditions) pairwise independence of linear mixtures implies their full mutualindependence. IV. P AIRWISE INDEPENDENCE IMPLYING FULL INDEPENDENCE
One of the basic, key concepts in ICA over R is the Darmois-Skitovich Theorem (e.g., [6] p.218),which is used, either explicitly or implicitly, in many ICA methods ([2]). This theorem states that if twolinear combinations (over R ) of statistically independent random variables are statistically independent,then all the random variables which participate (with non-zero coefficients) in both combinations mustbe Gaussian. Consequently (see, e.g., [2]), under the classical identifiability condition (for ICA over R ) of not more than one Gaussian source, pairwise statistical independence of linear mixtures of thesources always implies their full mutual statistical independence (namely, a non-mixing condition).As we shall show in this section, this property does not carry over to our GF ( P ) scenario by meresubstitution of the Gaussian distribution with the uniform. As it turns out, under our identifiabilitycondition (for ICA over GF ( P ) ) of no uniform sources, pairwise independence implies full inde-pendence in GF (2) and in GF (3) , but not in GF ( P ) with P > . The reason for this distinctionis the distinction made in Lemma 1 above, regarding the possibility that a linear combination ofnon-uniform, independent random variables be uniform in GF ( P ) with P > (but not in GF (2) orin GF (3) ).Indeed, consider three independent random variables s , s and s in GF (5) , with probabilityvectors p = p and p (resp.) following the example given in the proof of Lemma 1. Namely, let the respective characteristic vectors be given by ˜ p = ˜ p = [1 0 0 . . T and ˜ p = [1 0 . . T .This implies p = p ≈ [0 .
32 0 .
10 0 .
24 0 .
24 0 . T and p ≈ [0 .
36 0 .
25 0 .
07 0 .
07 0 . T . Clearly,our identifiability condition is satisfied here, since none of these random variables is uniform. However, s ⊕ s , as well as s ⊕ s , are uniform. Thus, consider the mixing-matrix A = h i which yields x = s x = s x = s ⊕ s ⊕ s . (19)Now, x and x are obviously statistically independent. Moreover, since s ⊕ s is uniform andindependent of s , we deduce, by Lemma 2, that x and x are also statistically independent. Similarly,by switching roles between s and s , we further deduce that x and x are statistically independentas well. Therefore, x , x and x are pair-wise independent, but are clearly not fully mutuallyindependent.Obviously, such a counter-example cannot be constructed in GF (2) or in GF (3) , since in thesefields a linear combination of non-uniform, statistically independent random variables cannot beuniform. Furthermore, out following theorem asserts that, under our identifiability conditions, pairwisestatistical independence of the mixtures indeed implies their full statistical independence in GF (2) and in GF (3) . Theorem 2:
Let s be a K × random vector whose elements are statistically-independent, non-degenerate and non-uniform random variables in GF (2) or in GF (3) . Let y = D ◦ s denote a K × vector of non-trivial linear combinations of the elements of s over the field, prescribed by the elementsof the K × K matrix D .If the elements of y are all pairwise statistically independent (namely, if y k is statistically inde-pendent of y ℓ for all k = ℓ , k, ℓ ∈ { , . . . K } ), then D = Π ◦ Λ , where Π is a K × K permutationmatrix and Λ is a K × K non-singular diagonal matrix in the field. In other words, the elements of y are merely a permutation of the (possibly scaled) elements of s , and are therefore not only pairwise,but also fully statistically independent.Obviously, in GF (2) Λ must be I (no scaling ambiguity), and in GF (3) (assuming the group { , , − } ), Λ has only ± -s along its diagonal (the scaling ambiguity is just a sign ambiguity). Aproof for each of the two cases, GF (2) and GF (3) , is provided in Appendix A. We now proceed topropose practical separation approaches.V. P RACTICAL S EPARATION A PPROACHES
In this section we propose two possible practical separation approaches, based on the propertiesdeveloped above. Note that any approach which exploits the full statistical description of the joint probability distri-bution of x would require collection (estimation) and some manipulation of the probabilities tensor P x , which is P K large, and, therefore, a computational load of at least O ( P K ) seems inevitable.Still, this is significantly smaller (and often realistically far more affordable) than O ( K P ( K ) ) (asrequired by brute-force search for the unmixing matrix), even for relatively small values of P and K .Note further, that in order to obtain reasonable estimates of P x in practice, the number of availableobservation vectors T has to be significantly larger than P K (the size of P x ). The estimation of P x can be obtained by the following simple collection process:1) Initialize b P x as an all-zeros tensor;2) For t = 1 , , ..., T , set b P x ( x [ t ]) ← b P x ( x [ t ]) + 1 ;3) Set b P x ← T · b P x .Fortunately, however, a single collection of the observation’s statistics for obtaining b P x is generallysufficient, since, in order to obtain the empirical statistical characterization b P y of any linear trans-formation y = G ◦ x of the observations (where G is an arbitrary L × K matrix with elements in GF ( P ) ), it is not necessary to actually apply the transformation to the T available observation vectorsand then recollect the probabilities. The same result can be obtained directly (without re-involvingthe observations), simply by applying a similar accumulation procedure to the K -way tensor b P x inconstructing the L -way tensor b P y :1) Initialize b P y as an all-zeros tensor;2) Running over all P K index-vectors i (from [0 · · · T to [ P − · · · P − T ), set b P y ( G ◦ i ) ← b P y ( G ◦ i ) + b P x ( i ) . (20)Note that when G is a square invertible matrix, b P y is simply a permutation of b P x . A. Ascending Minimization of EntRopies for ICA (AMERICA)
Our first approach is based on minimizing the individual entropies of the recovered sources. Concep-tually, such an approach can consist of going over all possible P K − nontrivial linear combinationsof the observations, and computing their respective entropies. Then, given these entropies, we needto select the K linear combinations with the smallest entropies, such that their respective linear-combination coefficients vectors (rows of the implied unmixing matrix) are linearly independent (in GF ( P ) ).Let us first consider the computation of the entropies of all possible (nontrivial) P K − linearcombinations prescribed by the coefficients vectors i n (for n = 1 , ..., P K − ). Each requires the computation of the respective probabilities vector p y n of y n = i Tn ◦ x , by applying the above-mentioned tensor-accumulation procedure with G = i Tn to the tensor b P x . Thus, the number ofrequired multiplications is roughly O ( K · ( P K ) ) = O ( K · P K ) , which (for K > ) is much smallerthan O ( K · P ( K ) ) (the brute-force search cost), but may still be quite large. Fortunately, it is possibleto compute the required probabilities vectors more conveniently, via the estimated characteristic tensor be P x , which can be obtained using a multidimensional Fast Fourier Transform (FFT).The proposed computation proceeds as follows. First, given the estimated probabilities tensor b P x ,we obtain the estimated characteristic tensor be P x using a K -dimensional FFT, by successively applying -dimensional radix- P DFTs along each of the K dimensions. Thus, for each dimension we compute P K − P -long DFTs, at the cost of O ( P K − · ( P log P )) = O ( P K log P ) . The total cost for obtaining be P x is therefore O ( K · P K log P ) = O ( P K log( P K )) , rather than O (( P K ) ) , as would be requiredby direct calculation.Now, in order to obtain the characteristic vector ˜ p y n of y n = i Tn ◦ x , we can exploit the followingrelation: ˜ p y n ( m ) = E [ W my n P ] = E [ W m i Tn x P ] = e P x ( m ⊗ i n ) , m = 0 , ..., P − , (21)which means that for each i n , each ( m -th) element of the characteristic vector of y n can be extractedfrom the respective element ( m ⊗ i n ) of e P x . Note further, that the first ( m = 0 ) element ofeach characteristic vector is ; and that the conjugate-symmetry of the characteristic vectors canbe exploited, such that only the ”first half” ( m = 1 , ..., ⌊ P/ ⌋ ) needs to be extracted from e P x .Naturally, in the absence of the true e P x , we would use the empirical be P x , obtained from the empiricalprobabilities tensor b P x , as described above.The extraction of the characteristic vectors ˜ p y n for all i n requires O ( P K · P K ) additional operations.Once these vectors are obtained, they are each converted, using inverse FFT, into probabilities vectors p y n , from which the entropies are readily obtained. This requires additional O ( P K · ( P log P + P )) operations (excluding the computation of P · P K logarithms).Given the entropies of all possible linear combinations (ignoring the trivial i = ), the one withthe smallest entropy corresponds to the first extracted source. Once the smallest-entropy source isidentified, a ”natural” choice is to proceed to the linear combination yielding the second-smallestentropy (and so forth), but special care has to be taken, so that each selected coefficients vectorsshould not be linearly dependent (in GF ( P ) ) on the previous ones. One possible way to assure this,is to take a “deflation” approach (also sometimes taken in classical ICA - see, e.g., [9] or [1]), in whicheach extracted source is first eliminated from the mixture, and then the lowest-entropy combinationof the remaining (“deflated”) mixtures is taken as the “next” extracted source. However, such anapproach requires finding the coefficients needed for elimination of the extracted source from each mixture element, as well as recalculation of all the entropies after each deflation stage, which seemscomputationally expensive. A possible alternative is to use a greedy sequential extraction, such thatthe k -th chosen coefficients vector is the one associated with the smallest entropy while being linearlyindependent of the previously selected k − coefficients vectors. Checking whether a K × vector ˆ b k is linearly independent of the K × vectors ˆ b , ˆ b , ..., ˆ b k − amounts to checking whether there existsa nonzero k × vector α , such that [ˆ b · · · ˆ b k ] ◦ α = , which can be checked by an exhaustivesearch among all possible nonzero k × vectors in GF ( P ) . This roughly adds (in the “worst”, laststage, with k = K ) O ( K · P K ) multiplications.The total computational cost is therefore approximately O ( P K · ( K + KP + K log P + P log P + P )) .The proposed algorithm, which was given the acronym “AMERICA” (Ascending Minimization ofEntRopies for ICA) is summarized in Table 1. Algorithm 1: AMERICAInput: b P x - the mixtures’ K -way P × P × · · · × P estimated (empirical)probabilities tensor; Output: ˆ B - the K × K estimated separation matrix; Notations:
We denote by the K × P -nary vector i n the n -th indexvector (for n = 0 , ..., P K − ), such that n = P Kk =1 i n ( k ) P k − ,where i n = [ i n (1) · · · i n ( K )] T ; All indices in the description below runfrom . Algorithm:
1) Compute be P x , the observations’ empirical characteristic tensor,by applying a K -dimensional radix- P FFT to b P x .2) For n = 0 , ..., P K − , compute h n , the (empirical) entropy of therandom variable y n △ = i Tn ◦ x as follows:a) Obtain the P × empirical characteristic vector of y n , denoted ˜ p n , as follows:i) Set ˜ p n (0) := 1 ;ii) Set ˜ p n (1) := be P x ( i n ) ;iii) If P = 3 , set ˜ p n (2) := be P ∗ x ( i n ) ;iv) If P > , then for m = 2 , ..., ( P − / , set ˜ p n ( m ) := be P x ( m ⊗ i n ) and ˜ p n ( P + 1 − m ) := be P ∗ x ( m ⊗ i n ) ;b) Obtain the P × empirical probabilities vector of y n , denoted p n , by applying an inverse FFT to the vector ˜ p n ;c) Obtain h n = P P − m =0 p n ( m ) log p n ( m ) ;3) Find the smallest entropy among h , ..., h P K − and denote theminimizing index n (i.e., h n = min n =0 h n );4) Set ˆ B := i Tn and mark h n as "used";5) Repeat for k = 2 , ..., K :a) Find the smallest among all "unused" entropies; denote theminimizing index n k ;b) Construct the test-matrix ¯ B := [ ˆ B T i n k ] ;c) Go over all nonzero length- k index vectors j n ( n = 1 , ..., p k − ),checking whether ¯ B ◦ j n = for some n . If such j n is found, mark h n k as "used" and find the next smaller entropy (i.e., go tostep 5a);d) Set ˆ B := ¯ B T . B. Minimizing Entropies by eXchanging In COuples (MEXICO)
An alternative separation approach, which avoids prior calculation of the entropies of all possiblelinear combinations, is to try to find the separating transformation by successively minimizing theentropies in couples (going over all couples combinations in each “sweep”). More specifically, let x and x denote the first two elements of the mixtures vector, and let P , denote their P × P joint probability matrix, which can be obtained from the tensor P x by summing along all otherdimensions: P , ( m, n ) = P − X i ,...,i K =0 P x ( m, n, i , . . . , i K ) m, n ∈ [0 , P − . (22)Consider a random variable of the form ¯ x = x ⊕ c ⊗ x , (23)where c ∈ [1 , P − is some constant. Let p ¯ x ( c ) denote the probabilities vector of ¯ x . The m -thelement of this vector is given (depending on c ) by p ¯ x ( m ; c ) = Pr { x ⊕ c ⊗ x = m } = P − X n =0 Pr { x = n, c ⊗ x = m ⊖ n } = P − X n =0 P , ( n, c − ⊗ ( m ⊖ n )) } , (24)where c − denotes the reciprocal of c in GF ( P ) , such that c ⊗ c − = 1 . The entropy of ¯ x is thengiven by H (¯ x ; c ) = − P − X m =0 p ¯ x ( m ; c ) log p ¯ x ( m ; c ) . (25)COnsider the value c of c which minimizes H (¯ x ; c ) . If the resulting entropy is smaller than theentropy of x , then substitution of x with ¯ x = x ⊕ c ⊗ x in x would be an invertible lineartransformation which reduces the sum of entropies of the elements of x .Note ,in addition, that following this transformation the mutual information I (¯ x , x ) = H (¯ x ) + H ( x ) − H (¯ x , x ) will be smaller than I ( x , x ) , because the joint entropies H ( x , x ) and H (¯ x , x ) are the same (since the transformation is invertible). Therefore, this transformation also makes thesetwo elements “more independent”.Thus, based on this basic operation, a separation approach can be taken as follows. Let y denotethe random vector of “demixed” sources to be constructed by successive linear transformations of x ,and initialize y = x , along with its probabilities tensor P y = P x . Proceed sequentially through allcouples y k , y ℓ in y : For each couple, compute the joint probabilities matrix P k,ℓ , and then look forthe value of c which minimizes the entropy of ¯ y k = y k ⊕ c ⊗ y ℓ . If this entropy is smaller than thatof y k , replace y k with ¯ y k , recording the implied linear transformation as ¯ y = V ( k, ℓ ; c ) ◦ y , where V ( k, ℓ ; c ) △ = I + c · E k,ℓ , (26) E k,ℓ denoting a K × K all-zeros matrix with a at the ( k, ℓ ) -th position. If the minimal entropy of ¯ y k is larger than that of y k , no update takes place, and the next couple is addressed.Upon an update, ¯ y serves as the new y . The probabilities tensor P y is updated accordingly(this update is merely a permutation, attainable using (20) with G = V ( k, ℓ ; c ) ). The procedure isrepeated for each indices-couple ( k, ℓ ) (with k = ℓ ), and we term a “sweep” as a sequential passover all possible K ( K − combinations (note that there is no symmetry here, namely, the couple ( ℓ, k ) is essentially different from ( k, ℓ ) ). Sweeps are repeated sequentially, until a full seep withouta single update occurs, which terminates the process.In practice, the algorithm is applied starting with the empirical observations’ probabilities tensor b P x , and the accumulated sequential left-product of the V ( k, ℓ ; c ) matrices yields the estimatedseparating matrix. Since the sum of marginal entropies of the elements of y is bounded below and isguaranteed not to increase (usually to decrease) in each sweep, and since the algorithm stops uponencountering the first sweep without such a decrease - such a stop is guaranteed to occur within afinite number of sweeps.Note, however, that in general there is no guarantee for consistent separation using this algorithm,i.e., even if the true probabilities tensor P x of the observations is known (and used), the stoppingpoint is generally not guaranteed to imply separation. The rationale behind this algorithm is the hopethat such a “pairwise separation” scheme would ultimately yield pairwise independence, which, atleast for P = 2 and P = 3 , would in turn imply full independence (hence separation), per Theorem 2above. Strictly speaking, however, this algorithm is not even guaranteed to yield pairwise separation.For example, consider the P = 2 case, with a mixing matrix A = , (27)when all the sources have equal p (1) (probability of taking the value ). In this particular case, thenumber of -s in a linear combination of any two lines is greater or equal to the number of -s ineach of the two lines. Therefore, there is no pairwise linear combination which reduces the entropyof any of the mixtures in this case. Therefore, the algorithm may stop short of full separation whensuch a condition is encountered.Nevertheless, such conditions are relatively rare, and, as we show in simulation results in thefollowing section, this algorithm is quite successful. Its leading advantage over AMERICA is in itsreduced computational complexity when the unmixing matrix B is sparse and K >> P .Indeed, the computational complexity of this iterative algorithm naturally depends on the numberof required sweeps and on the number of updates in each sweeps - which in turn depend strongly on the true mixing matrix A (and, to some extent, also on sources’ realizations). Testing each couple ( k, ℓ ) requires computation of the joint probabilities matrix P k,ℓ - which requires O ( P K ) additions(no multiplications are needed). Then, looking for the optimal c requires P − computations ofthe probabilities vector of the respective ¯ y k - a total of additional O ( P ) additions (again, nomultiplications are needed for this) and O ( P ) log operations. If an update takes place, recalculationof b P y is also needed, which is O ( P K ) (but, as mentioned above, this is merely a permutation of thetensor).Therefore, the first sweep requires O ( P ( P K + P )) = O ( P K +2 + P )) operations and O ( P ) log operations, plus O ( P K ) for each update within the sweep. Naturally, a couple tested in onesweep does not have to be tested in a subsequent sweep if no substitution involving any of itsmembers had occurred in the former. Therefore, for subsequent sweeps the number of operations canbe significantly smaller, depending on the number of updates occurring along the way - which isobviously data-dependent. The number of required sweeps is also data dependent.Thus, the computational complexity of this algorithm, assuming K > , can be roughly estimatedat O ( P K · ( N d P )) , where N d denotes a data-dependent constant, which can be very small (of theorder of − ) when the true demixing matrix B is very sparse (only a few sweeps with few updatesare needed), but can be considerably large when B is rather “rich”. Compared to the computationalcomplexity of AMERICA, we observe that, assuming K >> P , this algorithm is preferable if N d P < K .The algorithm, which was given the acronym “MEXICO” (Minimizing Entropies by eXchangingIn COuples) is summarized in Table 2. Algorithm 2: MEXICOInput: b P x - the mixtures’ K -way P × P × · · · × P estimated (empirical)probabilities tensor; Output: ˆ B - the K × K estimated separation matrix; Algorithm:
1) Initialize: ˆ B := I . Conceptually, we denote the "demixed" randomvector y △ = ˆ B ◦ x , so set b P y := b P x ;2) Initialize: h = [ h · · · h K ] T with the empirical entropies of the K respective elements y k (each computed from the empiricalprobabilities vector, which is obtained by summation over allother ( = k ) dimensions in b P y );3) Initialize F , as a K × K all-ones flags matrix: F ( k, ℓ ) = 1 meansthat the ( k, ℓ ) -th couple needs to be (re)tested;4) Run a "sweep": Repeat for k = 1 , . . . , K , for ℓ = 1 , . . . , K , ℓ = k If F ( k, ℓ ) = 1 do the following:a) Compute P k,ℓ , the empirical joint probabilities matrix of y k and y ℓ , by summation over all other dimensions ( = k, ℓ ) in b P y ;b) For c = 1 , . . . , P − , compute the elements of p ¯ y k ( c ) , theprobabilities vector of ¯ y k = y k ⊕ c ⊗ y ℓ , in a way similar to (24) ,yielding its entropy H (¯ y k ; c ) ;c) Denote the minimum entropy as H = H (¯ y k ; c ) (with c denoting theminimizing c );d) If H < h k apply a substitution:i) Set V = I + c · E k,ℓ ;ii) Update ˆ B := V ◦ ˆ B ;iii) Update the probabilities tensor using (20) with G = V ;iv) Mark all couples involving k as "need to be retested": F ( k, :) := 1 , F (: , k ) := 1 ;v) Update h k := H ;vi) (Conceptually: y := V ◦ y );e) Mark the ( k, ℓ ) -th element as "tested": F ( k, ℓ ) = 0 , and proceed;5) If F = I (there are still couples to be (re)tested), run anothersweep; Else stop. VI. R
UDIMENTARY PERFORMANCE ANALYSIS AND SIMULATION RESULTS
In this section we present a rudimentary analysis of the expected performance of the proposedalgorithms, in order to obtain an estimate of the expected rate of success in separating the sources,at least in some simple cases.Let us first establish the concept of equivariance . In classical ICA, an algorithm is called equivariant(see, e.g., [10]) with respect to the mixing matrix A , if its performance does not depend on A (aslong as it is invertible), but only on the realization of the sources. This appealing property is sharedby many (but certainly not by all) classical ICA algorithms (in the context of noiseless classical ICA).We shall now show that, with some slight modification, the AMERICA algorithm is equivariant.Recall that AMERICA is based on computation of all the empirical probabilities vectors p y n of therandom variables y n = i Tn ◦ x for all possible index-combinations i n , followed by sequential extractionof the index-vectors i n corresponding to the smallest entropies (while maintaining sequential mutuallinear independence). Although not directly calculated in this way in the algorithm, the ℓ -th elementof p y n is evidently given by p y n ( ℓ ) = c Pr { i Tn ◦ x = ℓ } = 1 T T X t =1 I { i Tn ◦ x [ t ] = ℓ } , (28)where c Pr {·} denoted the empirical probability, and where I {·} denotes the Indicator function. Butsince x [ t ] = A ◦ s [ t ] , we obviously have I { i Tn ◦ x [ t ] = ℓ } = I { ( A T ◦ i n ) T ◦ s [ t ] = ℓ } , (29)which means that with any given realization s [1] , · · · , s [ T ] of the sources, the empirical probabilitiesvector p y n of y n = i Tn ◦ x obtained when the mixing matrix is A , is equal to some empiricalprobabilities vector p y m of y m = i Tm ◦ x obtained when the mixing matrix is I (i.e., when thereis no mixing), such that i m = A T i n . Since A is invertible, this relation is bijective, which impliesthat the P K − empirical probabilities vectors obtained with any (invertible) mixing are merelya permutation of the same set of P K − vectors that would be obtained when the sources arenot mixed. Consequently, if, based on the empirical entropies of these empirical vectors, the matrix ˆ B = [ i n i n · · · i n K ] T is formed by the algorithm when the mixing-matrix is A , this implies thatthe matrix ˆ B △ = [ i m i m · · · i m K ] T = (cid:0) A T ◦ [ i n i n · · · i n K ] (cid:1) T = ˆ B ◦ A (30)would be formed by the algorithm when the sources are unmixed. Consequently, the overall mixing-unmixing matrix ˆ B ◦ A in the mixed case would equal the overall mixing-unmixing matrix ˆ B ◦ I =ˆ B ◦ A in the unmixed case. This means that, no matter what the (invertible) mixing matrix is, the This matrix is sometimes also called the “contamination” matrix, describing the residual mixing (if any). overall mixing-unmixing matrix would be the same as would be obtained by the AMERICA algorithmin the unmixed case - implying the desired equivariance property.There is, however, one small caveat that has to be considered. The reasoning above assumes thatthe sequential progress of the algorithm through the sorted empirical entropies for selecting, testing(for linear dependence) and using the index-vectors is uniquely determined by the calculated entropyvalues, and is independent of the values of the index-vectors. This is generally true, with one possibleexception: If the set of empirical entropies happens to contain a subset with equal entropies, the(arbitrary) order in which the index-vectors within such a subset are sorted is usually lexicographic- which introduces dependence on the actual index values, and such dependence is not permutation-invariant - thereby potentially introducing dependence on the mixing matrix in turn. In order to avoidthis condition, any sub-group with equal empirical entropies should be somehow inner-sorted in away which is independent of corresponding index-vectors values - e.g., by randomization. Note thatthe occurrence of such a subset (with empirical entropies that are exactly equal) becomes very rarewhen the number of observations T is large, but may certainly happen when T is relatively small.Note further, that with such randomization the attained separation for a given realization dependsnot only on the sources’ realization, but also on this random sorting within subsets (but not on themixing matrix), and therefore only statistical measures of the performance (e.g., the probability ofperfect separation) can be considered equivariant.Having established the equivariance, we now proceed to analyze the probability of perfect separationin the most simple case: P = 2 , K = 2 . Thanks to the equivariance property we may assume, withoutloss of generality, that the mixing matrix is the identity matrix, A = I . Let p s (1) = ρ (resp., p s (1) = ρ ) denote the probability with which the first (resp., second) source takes the value . Dueto the assumed non-mixing conditions ( A = I ), these are also the probabilities of the ”mixtures” x and x . To characterize the empirical probabilities tensor b P x , let us denote by N , N , N and N the number of occurrences of x [ t ] = [0 0] T , x [ t ] = [0 1] T , x [ t ] = [1 0] T and x [ t ] = [1 1] T (resp.)within the observed sequence of length T . Thus, the elements of the × empirical probabilitiestensor (matrix in this case) are b P x ( m , m ) = N m ,m /T , for m , m ∈ { , } .The empirical probability ˆ p x (1) of x taking the value is given by b P x (1 ,
0) + b P x (1 ,
1) =( N + N ) /T . The empirical probability ˆ p x ⊕ x (1) of the random variable x ⊕ x taking the value is given by b P x (1 , b P x (0 ,
1) = ( N + N ) /T . An identification error would occur if the entropyassociated with the latter be smaller than that associate with the former (because then the (wrong)linear combination vector i T = [1 1] would be preferred by the algorithm over the (correct) linearcombination vector i T = [1 0] as a row in ˆ B ).In the P = 2 case, the entropy is monotonically decreasing in the distance of p (1) (or p (0) ) from . Assuming that T is “sufficiently large”, the empirical ˆ p x (1) would be close to its true value ρ , and the empirical ˆ p x ⊕ x (1) would be close to its true value ρ (1 − ρ )+ ρ (1 − ρ ) = ρ + ρ − ρ ρ .Assuming that ρ , ρ < , both ρ and ρ + ρ − ρ ρ are smaller than , and we can thereforeassume that so are the empirical ˆ p x (1) and ˆ p x ⊕ x (1) . Thus, the empirical entropy associated withthe linear combination x ⊕ x would be smaller than that associated with x if ˆ p x ⊕ x (1) < ˆ p x (1) ⇔ T ( N + N ) < T ( N + N ) ⇔ N < N . (31)We are therefore interested in the probability of the event Ξ1 : N < N . Let us denote by N △ = N + N the number of occurrences of x [ t ] = 1 in [1 , T ] . The probability of Ξ1 can thenbe expressed as follows: Pr { Ξ1 } = Pr { N < N } = Pr { N > N } = T X M =1 Pr { N = M ∩ N > M } = T X M =1 Pr { N = M } Pr { N > M | N = M } . (32)Due to the statistical independence between the sources (and therefore between x and x ), giventhat N = M , the random variable N is simply the number of occurrences of x [ t ] = 1 among M independent trials - a Binomial random variable with M trials and probability ρ , which we shalldenote as N ,M ∼ B ( M, ρ ) . Thus, Pr { Ξ1 } = Pr { N < N } = T X M =1 Pr { N = M } Pr { N ,M > M } = T X M =1 (cid:18) TM (cid:19) ρ M (1 − ρ ) T − M · M X N = (cid:22) M (cid:23) +1 (cid:18) MN (cid:19) ρ N (1 − ρ ) M − N . (33)The inner sum is the complementary cumulative distribution function of the binomial distribution,which can also be expressed using the normalized incomplete beta function , Pr { N ,M > M } = 1 − Pr { N ,M ≤ M } =1 − I − ρ ( M − (cid:4) M (cid:5) , (cid:4) M (cid:5) + 1) = I ρ ( (cid:4) M (cid:5) + 1 , (cid:6) M (cid:7) ) , (34)with I p ( n, m ) △ = n (cid:18) n + m − m − (cid:19) p Z t n − (1 − t ) m − dt = 1 − I − p ( m, n ) . (35)Note further (from (33)), that the probability of Ξ1 can be expressed as Pr { Ξ1 } = E (cid:2) I ρ ( (cid:4) N (cid:5) + 1 , (cid:6) N (cid:7) ) (cid:3) (36) See, e.g., Binomial Distribution from Wikipedia [online], available: http://en.wikipedia.org/wiki/Binomial_distribution (where the expectation is taken with respect to N ). When ρ · T is ”sufficiently large” this probabilitymay be approximated by substituting N with its mean, E [ N ] = ρ · T , Pr { Ξ1 } ≈ I ρ ( (cid:4) ρ T (cid:5) + 1 , (cid:6) ρ T (cid:7) ) . (37)The event Ξ1 is just one possible component of an error event in which the algorithm wouldprefer the (wrong) linear combination vector i T = [1 1] over the (correct) linear combination vector i T = [1 0] . Such an error may also happen when the empirical entropies of x and of x ⊕ x areequal, namely when N = N : assuming that the algorithm makes a random decision in such cases(to ensure mean equivariance, as discussed above), the probability of an error being caused by thisevent (denoted Ξ2 ) would be Pr { Ξ2 } . Evidently, Pr { Ξ2 } = Pr { N = N } = T X M =0 Pr { N = M } Pr { N ,M = M } = ⌊ T/ ⌋ X M ′ =0 (cid:0) T M ′ (cid:1) ρ M ′ (1 − ρ ) T − M ′ · (cid:0) M ′ M ′ (cid:1) ρ M ′ (1 − ρ ) M ′ = ⌊ T/ ⌋ X M =0 T !(1 − ρ ) T ( T − M )!( M !) (cid:16) ρ ρ (1 − ρ )(1 − ρ ) (cid:17) M . (38)Note that since the event N ,M = M can only happen for even values of M , an approximationusing the mean with respect to N (as used for Pr { Ξ1 } above) would be far less accurate, and wouldtherefore not be pursued.Summarizing this part of the error analysis, the probability that the algorithm would wrongly prefer i T = [1 1] over i T = [1 0] as a row in ˆ B can be approximated as Pr { Ξ1 } + Pr { Ξ2 } ≈ I ρ ( (cid:4) ρ T (cid:5) + 1 , (cid:6) ρ T (cid:7) ) + · ⌊ T/ ⌋ X M =0 T !(1 − ρ ) T ( T − M )!( M !) (cid:16) ρ ρ (1 − ρ )(1 − ρ ) (cid:17) M . (39)An error in the “opposite” direction occurs when the algorithm prefers i T = [1 1] over i T = [0 1] as a row in ˆ B . The probability of this kind of error is evidently given by the same expressions byswapping the roles of ρ and ρ . A failure of the algorithm is defined as the occurrence of either oneof the two errors. Although they are certainly not mutually exclusive, we can still approximate (orat least provide an approximate upper-bound for) the probability of occurrence of either one, by thesum of probabilities of occurrence of each. Assuming, for further simplicity of the exposition, that ρ = ρ = ρ , the approximate probability of failure is given by Pr { Failure } ≈ · I ρ ( (cid:4) ρ T (cid:5) + 1 , (cid:6) ρ T (cid:7) ) + ⌊ T/ ⌋ X M =0 T !(1 − ρ ) T ( T − M )!( M !) (cid:16) ρ − ρ (cid:17) M . (40)Recall that two assumptions are necessary for this approximation to hold: i) that ρ is sufficientlysmaller than . ; and ii) that ρ · T is sufficiently large.In order to test this approximation we simulated the mixing and separation of K = 2 independentbinary ( P = 2 ) sources, each taking the value with probability ρ . In Fig. 1 we compare the analytic −3 −2 −1 ρ P r ob . o f E rr o r T=100
Fig. 1.
Empirical probability of failure (’o’) and its analytic approximation (solid) vs. the probability ρ for P = 2 , K = 2 sources, T = 100 . The empirical probabilities were obtained using , independent trials prediction (40) to the empirical probability of failure obtained in , independent experiments(the sources and the mixing matrix were drawn independently in each trial) vs. ρ for T = 100 .Failure of the separation is defined as the case in which ˆ B ◦ A is not a permutation matrix. We usedthe AMERICA algorithm for separation (but for this ( K = 2 ) case, similar results are obtained withMEXICO). The circles show the empirical probabilities, whereas the solid line shows the approximateanalytic prediction (40). The good match is evident.When K is larger than , an approximate error expression can be obtain by assuming that thistype of error can occur independently for each of the K ( K − / different couples. Under thisapproximate independence assumption, we get Pr { Failure; K } ≈ − (1 − Pr { Failure; K = 2 } ) K ( K − , (41)where Pr { Failure; K = 2 } is given in (40) above. We assume here, for simplicity of the exposition,that all of the sources take the value with similar probability ρ . Extension to the case of differentprobabilities can be readily obtained by using (39) for each (ordered) couple.To illustrate, we compare this expression in Fig. 2 to the empirical probability of failure (obtained −4 −3 −2 −1 T P r ob . o f E rr o r ρ =0.35K=2 K=6 Fig. 2.
Empirical probability of failure (’o’) and its analytic approximation (solid) vs. the observation length T for P = 2 , K = 2 , , , and sources with ρ = 0 . , using the AMERICA algorithm. The empiricalprobabilities were obtained using , independent trials in , independent experiments) vs. T for ρ = 0 . with K = 2 , , , , . Again, failure of theseparation is defined as the case in which ˆ B ◦ A is not a permutation matrix (namely, any resultwhich does not provide perfect separation of all of the K sources is considered a “failure”). A goodmatch is evident for the smaller values of K , with some departure for the higher values - as couldbe expected from the approximation induced by the error-independence assumption.Next, we compare the empirical, average running-times of the two separation algorithm underasymptotic conditions. The “asymptotic” conditions are emulated by substituting the estimated (em-pirical) probabilities tensor b P x with the true probabilities tensor P x as the input to the algorithms.We simulated two cases: A “full” mixing matrix and a “sparse” mixing matrix. The “full” K × K (non-singular) mixing matrices were randomly drawn in each trial as a product of a lower triangularand an upper triangular matrix. The lower triangular matrix L was generated with random valuesindependently and uniformly distributed in GF ( P ) on and below the main diagonal, substituting any -s along the main diagonal with -s; The upper diagonal matrix U was similarly generated by drawingall values above the main diagonal, and setting the main diagonal to all- -s. Then A = U ◦ L . For −4 −3 −2 −1 KP=2 A v eage R unn i ng T i m e [ S ] −4 −3 −2 −1 KP=3 2 4 610 −4 −3 −2 −1 KP=5 2 3 4 510 −4 −3 −2 −1 KP=7AMERICA − Full AMEXICO − Full AAMERICA − Sparse AMEXICO − Sparse A
Fig. 3.
Average running times (in [seconds]) for the AMERICA (dashed) and MEXICO (solid) algorithms, forfull (’*’) and sparse (’o’) matrices. Note that the AMERICA plots for both the full and the sparse mixing caseare nearly identical. generating the “sparse” matrices, the off-diagonal values of L and U were “sparsified” by randomly(and independently) zeroing-out each element, with probability . .The elements of each of the sources’ probabilities vectors p s , . . . p s K were drawn uniformly in (0 , and then normalized by their sum. The average running times (using Matlab r code [11] forboth algorithms on a PC Pentium r P and K areshown in Fig. 3. Both algorithms were applied to the same data, and the running times were averagedover independent trials. As expected, the AMERICA algorithm is seen to be insensitive to thestructure (full / sparse) of the mixing matrix; However, the MEXICO algorithm runs considerablyfaster when A is sparse. Therefore, in terms of running speed, MEXICO may be preferable whenthe mixing matrix is known to be sparse, especially for relatively high values of K .Note, however, that this advantage is somewhat overcast by a degradation in the resulting separationperformance. While perfect separation was obtained (thanks to the “asymptotic” conditions) in all ofthe timing experiments by the AMERICA algorithm, few cases of imperfect separation by MEXICOwere encountered, especially in the highest values of K with the “full” mixtures. −4 −3 −2 −1 P=3T M ean nu m be r o f un s epa r a t ed s ou r c e s −4 −3 −2 −1 P=5T 10 −4 −3 −2 −1 P=7TAMERICA − Full AMEXICO − Full AAMERICA − Sparse AMEXICO − Sparse A
Fig. 4.
Empirical mean number of unseparated sources (out of the K = 5 sources for AMERICA (dashed)and MEXICO (solid) algorithms, for full (’*’) and sparse (’o’) matrices for P = 3 , , . Each point reflect theaverage of , trials. Note that the AMERICA plots for both the full and the sparse mixing case are nearlyidentical. To conclude this section, we provide (in Fig.4) some empirical results showing the performancefor P = 3 , , with K = 5 sources, with random sources’ probabilities vectors. The randomizedelements of the probability vectors were independently drawn (for each source, at each trial) from auniform distribution, and then normalized such that the sum of elements of each probability vectoradds up to . The mixing matrix was randomized at each trial as described above, once for a “full A ” and once for a “sparse A ” version. In this experiment the performance is measured as the meannumber of unseparated sources, which is defined (per trial) as the number of rows in the resulting“contamination matrix” ˆ B ◦ A containing more than one non-zero element (since, by constructionof ˆ B in both MEXICO and AMERICA, ˆ B ◦ A is always nonsingular, this is exactly the number ofsources which remain unseparated by the algorithm). Each result on the plot reflects the average of , trials.Evidently, the AMERICA algorithm seems significantly more successful than the MEXICO algo-rithm, especially with the higher values of P (interestingly, the performance of AMERICA seems to improve with the increase in P , whereas the performance of MEXICO exhibits an opposite trend).The advantage of MEXICO is confined to cases of small P and large K , where its potentially reducedcomputational load does not come at the expense of a severe degradation in performance.VII. CONCLUSION
We provided a study of general properties, identifiability conditions and separation algorithms forICA over Galois fields of prime order P . We have shown that a linear mixture of independent sourcesis identifiable (up to permutation and, for P > , up to scale) if and only if none of the sources isuniform. We have shown that pairwise independence of an invertible linear mixture of the sourcesimplies their full independence (namely, implies that the mixture is a scaled permutation) for P = 2 and for P = 3 , but not necessarily for P > .We proposed two different iterative separation algorithms: The first algorithm, given the acronymAMERICA, is based on sequential identification of the smallest-entropy linear combinations of themixtures. The second, given the acronym MEXICO, is based on sequential reduction of the pairwisemutual information measures. We provided a rudimentary performance analysis for P = 2 , whichapplies to both algorithms with K = 2 , demonstrating a good fit of the empirical results to thetheoretical prediction. For higher values of K (still with P = 2 ), we demonstrated a reasonable firup to K ≈ for the AMERICA algorithm.AMERICA is guaranteed to provide consistent separation (i.e., to recover all sources when theobservation length T is infinite), and generally exhibits better performance (success rate) than MEX-ICO with finite data lengths. However, when the mixing-matrix is known to be sparse, MEXICO canhave some advantage over AMERICA is in its relative computational efficiency, especially for largervalues of K . Matlab r code for both algorithms is available online [11].Extensions of our results to common variants of the classical ICA problem, such as ICA inthe presence of additive noise, the under-determined case (more sources than mixtures), possiblealternative sources of diversity (e.g., different temporal structures) of the sources, etc. - are all possible.For example, just like in classical ICA, temporal or spectral diversity would enable to relax theidentifiability condition, so as to accommodate sources with uniform (marginal) distributions, whichmight be more commonly encountered. However, these extensions fall beyond the scope of our currentwork, whose main goal is to set the basis for migrating ICA from the real- (or complex-) valuedalgebraic fields to another. A CKNOWLEDGEMENT
The author would like to thank Jacob Goldberger for inspiring discussions on the use of FFT forthe AMERICA algorithm. A PPENDIX
A - A
PROOF OF T HEOREM GF (2) and GF (3) . Let s be a K × random vector whose elements are statistically-independent, non-degenerate and non-uniform randomvariables in either GF (2) or GF (3) , and let y = D ◦ s denote a K × vector of non-trivial linearcombinations of the elements of s over the field, prescribed by the elements of the K × K matrix D (in either GF (2) or GF (3) , resp.).Assume that D is a general matrix, and consider any pair y k and y ℓ ( k = ℓ ) in y . y k and y ℓ arelinear combinations of respective groups of the sources, indexed by the non-zero elements in D k, : and D ℓ, : , the k -th and ℓ -th rows (resp.) of D .Let us consider the case of GF (2) first. A. The GF (2) case The two groups composing y k and y ℓ define, in turn, three other subgroups (some of which maybe empty):1) Sub-group 1: Sources common to D k, : and D ℓ, : . Denote the sum of these sources as u ;2) Sub-group 2: Sources included in D k, : but excluded from D ℓ, : . Denote the sum of these sourcesas v ;3) Sub-group 3: Sources included in D ℓ, : but excluded from D k, : . Denote the sum of these sourcesas v .For example, if (for K = 6 ) D k, : = h i and D ℓ, : = h i , then u = s ⊕ s ⊕ s , v = s ⊕ s and v = s .Note that by construction (and by independence of the elements of s ), the random variables u , v and v are statistically independent. Their respective probabilities vectors and characteristic vectorsare denoted p ν = p ν (0) p ν (1) , ˜ p ν = θ ν , with θ ν = 1 − p ν (1) , for ν = u, v , v . (42)Obviously, y k = u ⊕ v and y ℓ = u ⊕ v , so their characteristic vectors are given by ˜ p y k = ˜ p u ⊙ ˜ p v = θ u θ v , ˜ p y ℓ = ˜ p u ⊙ ˜ p v = θ u θ v , (43)where ⊙ denotes the Hadamard (element-wise) product.Define the random vector w △ = [ y k y ℓ ] T , which can be expressed as the sum of three independentrandom vectors: y k y ℓ | {z } w = v | {z } △ = v ⊕ uu |{z} △ = u ⊕ v | {z } △ = v (44) The probabilities matrices of the vectors v , v and u are evidently given by P v = − p v (1) 0 p v (1) 0 P v = − p v (1) p v (1)0 0 P u = − p u (1) 00 p u (1) (45)and therefore their characteristic matrices are given by e P v = θ v θ v e P v = θ v θ v e P u = θ u θ u , (46)where (see Section II) θ ν = E [ W ν ] = E [( − ν ] = 1 − p ν (1) , for ν = v , v , u . Since v , v and u are statistically independent, the characteristic matrix of w is given by e P w = e P v ⊙ e P u ⊙ e P v = θ u θ v θ v θ u θ v θ v . (47)On the other hand, if y k and y ℓ are statistically independent, the characteristic matrix of w is alsogiven by e P w = ˜ p y k ˜ p Ty ℓ = θ v θ u θ u θ v θ u θ v θ v . (48)Equating the expressions on (47) and (48), we get (only the (2 , element can differ) θ u θ v θ v = θ v θ v . (49)Since, due to Lemma 1, if neither of the sources is uniform, neither are v and v , we have θ v , θ v = 0 ,and therefore θ u must be either or − . Since neither of the sources is degenerate, this can onlyhappen if u = 0 (deterministically), which can only happen if sub-group 1 is empty, namely, if thetwo rows D k, : and D ℓ, : do not share common sources, or, in other words, if there is no column m in D such that both D k,m and D ℓ,m are .Applying this to all possible pairs of k = ℓ (for which y k and y ℓ are independent), and recallingthat D cannot have any all-zeros row (no trivial combinations in y ), we immediately arrive at theconclusion that each row and each column of D must contain exactly one , meaning that D is apermutation matrix.We now turn to the case of GF (3) . B. The GF (3) case For simplicity of the exposition, we shall now assume that the values taken in GF (3) are { , − } (rather than { , , } ). Just like in the GF (2) case, we partition the two groups composing y k and y ℓ into subgroups, but now the first (“common”) subgroup is further partitioned into three sub-subgroups:1) Sub-group 1: Sources common to D k, : and D ℓ, : . We partition this sub-group into four sub-subgroups according to the coefficients in the respective rows of D as follows: • Sub-subgroup 1a: sources for which the respective coefficients in D k, : and D ℓ, : are both ;Denote the sum of these sources as u ++ ; • Sub-subgroup 1b: sources for which the respective coefficients in D k, : and D ℓ, : are both − ; Denote the sum of these sources as u −− ; • Sub-subgroup 1c: sources for which the respective coefficients in D k, : and D ℓ, : are and − , resp.; Denote the sum of these sources as u + − ; • Sub-subgroup 1d: sources for which the respective coefficients in D k, : and D ℓ, : are − and , resp.; Denote the sum of these sources as u − + ;2) Sub-group 2: Sources included in D k, : but excluded from D ℓ, : . Denote the respective linearcombination of these sources as v ;3) Sub-group 3: Sources included in D ℓ, : but excluded from D k, : . Denote the respective linearcombination of these sources as v .For example, if (for K = 6 ) D k, : = h − i and D ℓ, : = h − − i ,then u ++ = s ⊕ s , u −− = u − + = 0 , u + − = s , v = − s ⊕ s and v = − s .The random variables u ++ , u −− , u + − , u − + , v and v are statistically independent. Their respectiveprobabilities vectors and characteristic vectors are denoted p ν = p ν (0) p ν (1) p ν (2) , ˜ p ν = ξ ν ξ ∗ ν , for ν = u ++ , u −− , u + − , u − + , v , v . (50)An expression for ξ ν = E [ W ν ] in terms of p ν (0) , p ν (1) and p ν (2) can be found in (5) above. Notefurther, that ξ − ν = ξ ∗ ν , so that ˜ p − ν = ˜ p ∗ ν .Evidently, y k = v ⊕ u ++ ⊖ u −− ⊕ u + − ⊖ u − + ; , y ℓ = v ⊕ u ++ ⊖ u −− ⊖ u + − ⊕ u −− , (51)so their characteristic vectors are given by ˜ p y k = ˜ p v ⊙ ˜ p u ++ ⊙ ˜ p ∗ u −− ⊙ ˜ p u + − ⊙ ˜ p ∗ u − + ˜ p y ℓ = ˜ p v ⊙ ˜ p u ++ ⊙ ˜ p ∗ u −− ⊙ ˜ p ∗ u + − ⊙ ˜ p u − + (52)The random vector w △ = [ y k y ℓ ] T can now be expressed as the sum of five independent randomvectors: y k y ℓ | {z } w = v | {z } △ = v ⊕ u ++ u ++ | {z } △ = u ++ ⊕ − u −− − u −− | {z } △ = u −− ⊕ u + − − u + − | {z } △ = u + − ⊕ − u − + u − + | {z } △ = u − + ⊕ v | {z } △ = v (53) The probabilities matrices of the vectors v , v , u ++ , u −− , u + − and u − + , and their respectivecharacteristic matrices are given by P v = p v (0) 0 0 p v (1) 0 0 p v (2) 0 0 ⇒ e P v = ξ v ξ v ξ v ξ ∗ v ξ ∗ v ξ ∗ v ; (54a) P v = p v (0) p v (1) p v (2)0 0 00 0 0 ⇒ e P v = ξ v ξ ∗ v ξ v ξ ∗ v ξ v ξ ∗ v ; (54b) P u ++ = p u ++ (0) 0 00 p u ++ (1) 00 0 p u ++ (2) ⇒ e P u ++ = ξ u ++ ξ ∗ u ++ ξ u ++ ξ ∗ u ++ ξ ∗ u ++ ξ u ++ ; (55a) P u −− = p u −− (0) 0 00 p u −− (2) 00 0 p u −− (1) ⇒ e P u −− = ξ ∗ u −− ξ u −− ξ ∗ u −− ξ u −− ξ u −− ξ ∗ u −− ; (55b) P u + − = p u + − (0) 0 00 0 p u + − (1)0 p u + − (2) 0 ⇒ e P u + − = ξ ∗ u + − ξ u + − ξ u + − ξ ∗ u + − ξ ∗ u + − ξ u + − ; (55c) P u − + = p u − + (0) 0 00 0 p u − + (2)0 p u − + (1) 0 ⇒ e P u + − = ξ u − + ξ ∗ u − + ξ ∗ u − + ξ u − + ξ u − + ξ ∗ u − + ; (55d)Thus, the characteristic matrix of w is given by the Hadamard product of these matrices, e P w = e P v ⊙ e P v ⊙ e P u ++ ⊙ e P u −− ⊙ e P u + − ⊙ e P u − + . (56)Now, if y k and y ℓ are statistically independent, then e P w is also given by the outer product of theircharacteristic vectors, which, using (52), is given by e P w = ˜ p y k ˜ p Ty ℓ = (˜ p v ˜ p Tv ) ⊙ (˜ p u ++ ˜ p Tu ++ ) ⊙ (˜ p ∗ u −− ˜ p Hu −− ) ⊙ (˜ p u + − ˜ p Hu + − ) ⊙ (˜ p ∗ u − + ˜ p Tu − + ) , (57)where ( · ) H denotes the conjugate transpose. Noting that ˜ p v ˜ p Tv = e P v ⊙ e P v , and recalling that,since v and v cannot be uniform, ξ v and ξ v must be non-zero, we conclude that the independenceof y k and y ℓ implies that (˜ p u ++ ˜ p Tu ++ ) ⊙ (˜ p ∗ u −− ˜ p Hu −− ) ⊙ (˜ p u + − ˜ p Hu + − ) ⊙ (˜ p ∗ u − + ˜ p Tu − + ) = e P u ++ ⊙ e P u −− ⊙ e P u + − ⊙ e P u − + . (58)It is easy to observe, that the first row and first column of each of the matrices on the left-hand side(LHS) are indeed always identical to those of the respective matrices on the right-hand side (RHS), regardless of the values of the ξ parameters. In addition, in each of the matrices the (2 , element is the conjugate of the (3 , element, and the (2 , element is the conjugate of the (3 , element.Therefore, the independence of y k and y ℓ merely implies the equality of the products of the (2 , elements on the LHS and on the RHS, and of the products of the (2 , elements on the LHS andon the RHS.The equality of the product of the (2 , elements implies ξ ∗ u ++ · ξ u −− · · ξ u ++ ) · ( ξ ∗ u −− ) · | ξ u + − | · | ξ u − + | , (59a)and the equality of the product of the (2 , elements implies · · ξ ∗ u + − · ξ u − + = | ξ u ++ | · | ξ u −− | · ( ξ u + − ) · ( ξ ∗ u − + ) . (59b)Taking the absolute values of both, and recalling that since neither of the random variables u ++ , u −− , u + − and u − + can be uniform, neither of the ξ parameters can be zero, we have | ξ u ++ | · | ξ u −− | = | ξ u ++ | · | ξ u −− | · | ξ u + − | · | ξ u − + | ⇒ | ξ u ++ | · | ξ u −− | · | ξ u + − | · | ξ u − + | = 1 | ξ u + − | · | ξ u − + | = | ξ u ++ | · | ξ u −− | · | ξ u + − | · | ξ u − + | ⇒ | ξ u ++ | · | ξ u −− | · | ξ u + − | · | ξ u − + | = 1 . (60)Since for any random variable ν in GF (3) , | ξ ν | ≤ with equality iff ν is degenerate, we concludefrom (60) that if y k and y ℓ are independent, then u ++ , u −− , u + − and u − + must all be degenerate.Since none of the independent sources is degenerate, this implies, in turn, that all four are identicallyzero, and that there are no non-zero elements common to D k, : and D ℓ, : .Like in the GF (2) case, by repeated application of this result to all row-couples in D , we concludethat pairwise independence of the elements of y implies that D is (up to signs) permutation matrix,namely that the elements of y are fully mutually independent.R EFERENCES [1] A. Yeredor, “ICA in boolean XOR mixtures,”
Proc., ICA 2007, Lecture Notes in Computer Science (LNCS 4666) ,vol. 45, no. 1, pp. 17–24, 2007.[2] P. Comon, “Independent component analysis - a new concept?,”
Signal Processing , vol. 36, pp. 287–314, 1994.[3] Jean-Franc¸ois Cardoso, “Blind signal separation: statistical principles,”
Proceedings of the IEEE , vol. 86, no. 10, pp.2009–2025, 1998.[4] A. Hyvarinen, J. Karhunen, and E. Oja,
Independent Component Analysis , John Wiley & Sons, inc., 2001.[5] J. Eriksson and V. Koivunen, “Complex random vectors and ICA models: identifiability, uniqueness, and separability,”
IEEE Trans. Information Theory , vol. 52, no. 3, pp. 1017–29, 2006.[6] C.R. Rao,
Linear Statistical lnjerence and its Applications , Wiley, New-York, 1973.[7] A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, and E. Moulines, “A blind source separation technique using second-order statistics,”
IEEE Trans. Signal Processing , vol. 45, no. 2, pp. 434–44, 1997. In this context (only) we enumerate these matrices’ rows and columns as , , rather than , , . [8] D.-T. Pham and J.-F. Cardoso, “Blind separation of instantaneous mixtures of nonstationary sources,” IEEE Trans.Signal Processing , vol. 49, no. 9, pp. 1837–48, 2001.[9] N. Delfosse and P. Loubaton, “Adaptive blind separation of independent sources: a deflation approach,”
SignalProcessing , vol. 45, no. 1, pp. 59–83, 1995.[10] J.-F. Cardoso and B. Laheld, “Equivariant adaptive source separation,”
IEEE Trans. on Signal Processing , vol. 44,no. 12, pp. 3017–3030, 1996.[11] A. Yeredor,
Matlab r code for ICA over GF ( P ) , AMERICA and MEXICO ,Online: ∼ arie/ICA4GFP.rararie/ICA4GFP.rar