Effect of indirect dependencies on "A mutual information minimization approach for a class of nonlinear recurrent separating systems"
aa r X i v : . [ s t a t . M E ] O c t Effect of indirect dependencies on ”A mutual informationminimization approach for a class of nonlinear recurrentseparating systems”
Yannick Deville , Alain Deville , and Shahram Hosseini (1) Laboratoire d’Astrophysique de Toulouse-Tarbes, Universit´e de Toulouse, CNRS, 14Av. Edouard Belin, 31400 Toulouse, France. Email: [email protected] ,[email protected](2) IM2NP, Universit´e de Provence, Centre de Saint-J´erˆome, 13397 Marseille Cedex 20,France. Email: [email protected] Abstract.
In a recent paper [4], Duarte and Jutten investigated the Blind SourceSeparation (BSS) problem, for the nonlinear mixing model that they introduced in thatpaper. They proposed to solve this problem by using information-theoretic tools, moreprecisely by minimizing the mutual information (MI) of the outputs of the separatingstructure. When applying the MI approach to BSS problems, one usually determines theanalytical expressions of the derivatives of the MI with respect to the parameters of theconsidered separating model. In the literature, these calculations were mainly reportedfor linear mixtures up to now. They are more complex for nonlinear mixtures, due todependencies between the considered quantities. Moreover, the notations commonlyemployed by the BSS community in such calculations may become misleading when usingthem for nonlinear mixtures, due to the above-mentioned dependencies. We claim thatthe calculations reported in [4] contain an error, because they did not take into accountall these dependencies. In this document, we therefore explain this phenomenon, byshowing the effect of indirect dependencies on the application of the MI approach to themixing and separating models considered in [4]. We thus introduce a corrected expressionof the gradient of the considered BSS criterion based on MI. This correct gradient maythen e.g. be used to optimize the adaptive coefficients of the considered separating systemby means of the well-known gradient descent algorithm. As explained hereafter, thisinvestigation has some similarities with an analysis that we previously reported in anotherarXiv document [3]. However, these two investigations concern different problems, notonly in terms of the considered type of mixture and separating structure, but also of themathematical tools used to develop BSS methods for these configurations (informationtheory vs maximum likelihood approach).
Keywords.
Information theory, mutual information, blind signal separation, in-dependent component analysis, nonlinear mixture, additive-target mixture (ATM),recurrent separating structure, indirect dependency, total derivative, partial derivative,gradient. 1
Data model
Blind source separation (BSS) consists in restoring a vector s ( t ) of N unknown sourcesignals from a vector x ( t ) of P observed signals (most often with P = N ), where x ( t ) isderived from s ( t ) through an unknown mixing function g , i.e. x ( t ) = g ( s ( t )) . (1)Recently, Duarte and Jutten investigated a specific version of this problem [4], whichinvolves P = 2 observed signals x ( t ) and x ( t ), which are derived from N = 2 sourcesignals s ( t ) and s ( t ), through the nonlinear function defined as x ( t ) = s ( t ) + a ( s ( t )) k (2) x ( t ) = s ( t ) + a ( s ( t )) k . (3)This data model is derived from the Nikolsky-Eisenman empirical model forpotentiometric-based ion concentration sensors [4]. As in [4], we omit the time index t in signal notations hereafter, for readability. The mixing model (2)-(3) may then also beexpressed in compact form as x = g ( s ) . (4)In this equation, s = [ s , s ] T and x = [ x , x ] T , where T stands for transpose, and thenonlinear mixing function g has two components g and g , with x i = g i ( s ) , ∀ i ∈ { , } .These components g i are respectively defined by (2) and (3). Eq. (4) focuses on the signals(i.e. sources and observations). It hides the fact that the observations also depend on theparameters of the mixing model, i.e. on a and a in the model considered here. Thisadditional dependency can be made explicit, by rewriting (4) as x = g ( s, a , a ) . (5) As suggested above, the BSS problem associated with the mixing model (2)-(3) consistsin retrieving a sequence of unknown source vectors s from the corresponding sequence ofmeasured observation vectors x and from the mixing parameters a and a , which arealso initially unknown. These mixing parameters should therefore be estimated beforeproceeding to the source restoration step. Creating an overall BSS method thus consistsin defining two items, i.e. i) a ”separating structure”, which performs the inversion ofthe mixing equations (2)-(3) for known mixing parameter values, and ii) a procedure forestimating these mixing parameters.The separating structure used in [4] was derived by Duarte and Jutten from the struc-ture for linear-quadratic mixtures proposed by Hosseini and Deville in [5],[6],[1],[2]. Thestructure in [4] belongs to the general class of structures proposed by Deville and Hosseiniin [2] for the ATM class of mixing models, which includes the specific model (2)-(3).As for the estimation of the mixing parameters, Duarte and Jutten developed a pro-cedure based on information-theoretic tools, more precisely on the minimization of the2utual information (MI) of the outputs of the separating structure. However, we hereclaim that this procedure contains an error, which is due to a difficulty encountered with nonlinear mixing models in general, for different classes of BSS methods. This difficultyis somewhat similar to the one that we highlighted in another arXiv document [3]: un-like the method considered hereafter, the BSS approach described in [3] is not based oninformation theoretic tools, but on the maximum likelihood framework. Moreover, it con-cerns a different class of nonlinear mixtures. However, similar quantities appear in thecalculations performed for both methods , and they deserve special care in both of them.The current document therefore aims at explaining and correcting the error which wasmade in [4]. We thus show how the BSS method of [4] should be modified so as to actuallyachieve mutual information minimization. Before focusing on the issue faced in [4], wenow summarize the features of that approach which are of importance hereafter. The considered separating structure has internal adaptive coefficients w and w . Foreach time t , this structure determines and output vector y = [ y , y ] T from its currentinternal coefficients and from the current observation vector x . To this end, it iterativelyupdates its output according to y ( n + 1) = x − w ( y ( n )) k (6) y ( n + 1) = x − w ( y ( n )) k . (7)The convergence of this recurrence therefore corresponds to a state such that y = x − w y k (8) y = x − w y k . (9)For a given time t , we denote as Y and Y the random variables respectively associ-ated with the output signal samples y and y obtained after the above recurrence hasconverged. We also define the corresponding output random vector as Y = [ Y , Y ] T .The optimum values of w and w are defined as those which minimize the mutualinformation of Y and Y , which is denoted I ( Y ). Equivalently, they are those whichminimize a quantity C ( Y ). This quantity is equal to I ( Y ), up to an additive term whichonly depends on the observations and which therefore does not depend on w and w .That quantity reads C ( Y ) = X i =1 H ( Y i ) ! − E { ln | J h |} (10)where H ( Y i ) is the differential entropy of Y i while E { . } stands for expectation and J h isthe Jacobian of the separating function h = g − achieved by the considered separating The quantities to be respectively considered in these two methods depend on different signals (sourcesignals vs outputs of separating system) and functions (mixing function vs separating function). However,these signals and functions yield similar phenomena concerning the topic addressed in this document. For the sake of readability, we use the same notation, i.e. J h , for (i) the sample value of this Jacobianassociated to sample values y and y (see e.g. (11)) and (ii) the random variable defined by this quantitywhen considered as a function of the random variables Y and Y (see e.g. (12)). To know whether we areconsidering the sample value of J h or the associated random variable in an equation, one just has to checkwhether that equation involves the sample values y and y or the associated random variables Y and Y :see e.g. (11) and (12). J h is the determinant of the Jacobian matrix of h . For the function h considered in this investigation, the authors show that J h = 11 − w w y k − y k − . (11)To determine the values of w and w which minimize C ( Y ), the authors then considerthe gradient of C ( Y ) with respect to the vector composed of w and w . Each componentof this gradient is equal to the derivative of C ( Y ) with respect to one of the parameters w kℓ . In [4], the authors denoted this gradient by using the notation most often employedin the BSS community (see e.g. [7]), i.e. each of its components reads ∂C ( Y ) ∂w kℓ . We keep thisnotation in this section, in order to clearly refer to the equations available in [4], but inSection 3 we will show that it may be misleading and we will therefore introduce anothernotation. So, in [4], it was showed that these derivatives read ∂C ( Y ) ∂w kℓ = X i =1 E { ψ i ( Y i ) ∂Y i ∂w kℓ } ! − E { J h ∂J h ∂w kℓ } (12)where ψ i ( u ) = − d ln f Y i ( u ) du ∀ i ∈ { , } (13)are the score functions of the output signals, denoting f Y i ( . ) the probability density func-tions of these signals.The last stage of this investigation consists in deriving the expressions of all the terms ofthe right-hand side of (12). In Equation (26) of [4], an explicit expression is provided andit is stated that it is equal to (the vector form of) the term E { J h ∂J h ∂w kℓ } which appears in(12). We claim that this is not true, because the expression whose expectation is providedin the right-hand side of Equation (26) of [4] is only one of the terms which composethe complete expression to be then used in (12) as the term misleadingly denoted J h ∂J h ∂w kℓ in (12). In the following section of the current document, we clarify this point and wedetermine the complete expression of the term denoted J h ∂J h ∂w kℓ in (12). We also commentabout the other terms of (12). When determining the values of w and w which minimize C ( Y ), that function C ( Y )is considered for the fixed set of observed vectors. The only independent variable in thisapproach is the set of parameters to be estimated, i.e. w and w . The outputs y and y of the separating system are dependent variables, here linked to the observations and to w and w by (8)-(9). The overall variations of C ( Y ) with respect to w and w resultfrom two types of terms contained in the expression of C ( Y ), i.e. (i) the terms involving w and w themselves and (ii) the terms involving the output random variables Y and Y , which are here considered as functions of w and w and which may therefore bedenoted as Y ( w , w ) and Y ( w , w ) for the sake of clarity.This approach should be kept in mind when interpreting all equations in [4], whichwere partly gathered in Section 2 of the current document. Especially, the func-tion C ( Y ) itself, which appears in the left-hand side of (10), may be denoted as4 ( w , w , Y ( w , w ) , Y ( w , w )) for the sake of clarity. In order to determine thelocation of the minimum of this function, one should then consider the total derivatives of C ( w , w , Y ( w , w ) , Y ( w , w )) with respect to w and w . The notations withpartial derivatives in (12) may therefore be misleading, as confirmed below. Therefore,(12) should preferably be rewritten as dC ( Y ) dw kℓ = X i =1 E { ψ i ( Y i ) dY i dw kℓ } ! − E { J h dJ h dw kℓ } (14)still with (13). The term dJ h dw kℓ in (14) then deserves some care because, as shown by (11),the Jacobian J h contains the above-defined two types of dependencies with respect to w and w , i.e. (i) direct dependencies due to the factors in (11) which explicitly contain w and w and (ii) indirect dependencies due to the factors in (11) which depend on y and y , which themselves depend on w and w in this approach. We here have to considerthe total derivative dJ h dw kℓ , which takes into account both types of dependencies, and whichtherefore reads dJ h dw kℓ = ∂J h ∂w kℓ + X i =1 ∂J h ∂y i dy i dw kℓ . (15)In this expression, ∂J h ∂w kℓ is the partial derivative of J h with respect to w kℓ , calculated byconsidering that the signals y and y are constant (in addition to the fact that the otherinternal coefficient w ij of the separating system is also constant). This partial derivativeis the quantity that is taken into account in the right-hand side of (26) of [4]. However, letus insist again that this partial derivative is first to be added with the other terms in theright-hand side of (15), in order to obtain the overall total derivative dJ h dw kℓ defined by (15).What should eventually be used in the last term of (12) or (14) is this total derivative.So, starting from the expression of J h provided in (11), one easily derives all its partialderivatives involved in (15). They read as follows ∂J h ∂w = w y k − y k − [1 − w w y k − y k − ] (16) ∂J h ∂w = w y k − y k − [1 − w w y k − y k − ] (17) Each derivative dC ( Y ) dw kℓ is ”total” only with respect to the considered coefficient w kℓ (which is one ofthe two coefficients w and w ), i.e. it takes into account all variations of C ( y ) with respect to thatcoefficient w kℓ while the other coefficient, i.e. w ℓk , is kept constant. For the sake of clarity, we couldtherefore denote that derivative (cid:16) dC ( Y ) dw kℓ (cid:17) w ℓk , to show that w ℓk is constant. However, this would decreasereadability. Therefore, in all this paper we omit the notation ( . ) w ℓk , but it should be kept in mind that eachconsidered derivative with respect to w kℓ is calculated with w ℓk constant. Then, in this framework, whatwe have to distinguish are: (i) the total derivative due to the variations of w kℓ , Y and Y and (ii) the partialderivative only due to w kℓ . We then have to use two different notations for these two types of derivatives,such as dJ h dw kℓ and ∂J h ∂w kℓ in (15). This type of notations is commonly used in the literature for functionswhich depend (i) on a single independent variable, i.e. time, and (ii) on other variables which themselvesdepend on time, such as coordinate variables: see e.g. http://en.wikipedia.org/wiki/Total derivative . Wehere extend this concept to a configuration which involves several independent variables, i.e. w and w (and, again, other variables which themselves depend on the independent variables, i.e. Y and Y ). Wekeep the same type of notations as in the standard case involving a single independent variable. J h ∂y = w w (cid:16) k − (cid:17) y k − y k − [1 − w w y k − y k − ] (18) ∂J h ∂y = w w y k − ( k − y k − [1 − w w y k − y k − ] . (19)The case when k = 1 deserves a comment. As shown by (2)-(3), the mixing model thenbecomes linear. Besides, as shown by (18)-(19), we then have ∂J h ∂y = 0 (20) ∂J h ∂y = 0 , (21)so that the total derivative dJ h dw kℓ in (15) becomes equal to the partial derivative ∂J h ∂w kℓ in (15).This clearly shows that the problems due to the distinction between these two derivatives,that we address in this paper, concern nonlinear mixtures.The last terms which are required to obtain the complete expressions in (14) and (15)are all four derivatives dy i dw kℓ . For the sake of clarity, we now show how they may be con-sidered, when taking into account the above comments about total and partial derivatives.Here again, w and w should be considered as the independent variables, while y and y are functions of them and the observations are constant. All these parameters are linkedby (8)-(9). By first computing the total derivatives of the latter equations with respect to w , one gets dy dw = − ( y k + w ky k − dy dw ) (22) dy dw = − w k y k − dy dw . (23)Inserting (23) in (22), one derives dy dw = − y k − w w y k − y k − . (24)Then inserting (24) in (23), one obtains dy dw = w