On a convergence property of a geometrical algorithm for statistical manifolds
OOn a convergence property of a geometrical algorithm forstatistical manifolds ∗ Shotaro Akaho , Hideitsu Hino , Noboru Murata , National Institute of Advanced Industrial Science and Technology,Tsukuba, Ibaraki 305-8568, Japan The Institute of Statistical Mathematics, Tachikawa, Tokyo, 190-8562, Japan Waseda University, Shinjuku, Tokyo 169-0072, Japan RIKEN Center for Advanced Intelligence Project, Chuo, Tokyo 103-0027, JapanSeptember 30, 2019
Abstract
In this paper, we examine a geometrical projection algorithm for statistical inference. Thealgorithm is based on Pythagorean relation and it is derivative-free as well as representation-free that is useful in nonparametric cases. We derive a bound of learning rate to guaranteelocal convergence. In special cases of m-mixture and e-mixture estimation problems, wecalculate specific forms of the bound that can be used easily in practice.
Information geometry is a framework to analyze statistical inference and machine learning[2].Geometrically, statistical inference and many machine learning algorithms can be regarded asprocedures to find a projection to a model subspace from a given data point. In this paper,we focus on an algorithm to find the projection.Since the projection is given by minimizing a divergence, a common approach to findingthe projection is a gradient-based method[6]. However, such an approach is not applicable insome cases. For instance, several attempts to extend the information geometrical frameworkto nonparametric cases[3, 9, 13, 15], where we need to consider a function space or each datais represented as a point process. In such a case, it is difficult to compute the derivative ofdivergence that is necessary for gradient-based methods, and in some cases, it is difficult todeal with the coordinate explicitly.Takano et al.[15] proposed a geometrical algorithm to find the projection for nonpara-metric e-mixture distribution, where the model subspace is spanned by several empiricaldistributions. The algorithm that is derived based on the generalized Pythagorean theoremonly depends on the values of divergences. It is derivative-free as well as representation-free,and it can be applicable to many machine learning algorithms that can be regarded as finding ∗ This is a full version of the paper presented in ICONIP2019 a r X i v : . [ c s . L G ] S e p projection, but its convergence property has not been analyzed yet. The first contributionof this paper is to extend the algorithm to more general cases. The second contribution is togive a condition for the convergence of the algorithm, which is given as a bound of learningrate. In the case of the discrete distribution, we obtain specific forms of the bound that canbe used easily in practice. Here we briefly review the information geometry in order to explain the proposed geometricalalgorithm based on generalized Pythagorean theorem[12].Let (
S, g, ∇ , ˜ ∇ ) be a statistical manifold, where S is a smooth manifold with a Riemannianmetric g , dual affine connections ∇ and ˜ ∇ . We consider the case that S is (dually) flat, wherethere exist a ∇ -affine coordinate θ = ( θ , . . . , θ d ) and a ˜ ∇ -affine coordinate η = ( η , . . . , η d ).For a flat manifold, there exist potential functions ψ ( θ ) and φ ( η ), and the two coordinates θ and η are transformed each other by Legendre transform, θ i = ∂φ ( η ) ∂η i , η i = ∂ψ ( θ ) ∂θ i , ψ ( θ ) + φ ( η ) − d (cid:88) i =1 θ i η i = 0 . (1)A typical example of a flat manifold is an exponential family, where each member of themanifold is a distribution of a random variable x with parameter ξ = ( ξ , . . . , ξ d ), p ( x ; ξ ) = exp (cid:32) d (cid:88) i =1 ξ i F i ( x ) − b ( ξ ) (cid:33) , (2)where F i ( x ) is a sufficient statistics and exp( − b ( ξ )) is a normalization factor. For the ex-ponential family, there are two dual connections, called e-connection and m-connection (e:exponential, m: mixture). If we take the e-connection as the ∇ -connection, ∇ -affine coordi-nate θ is equal to ξ called e-coordinate, and ˜ ∇ -affine coordinate called m-coordinate is givenby ζ i = ∂b ( ξ ) ∂ξ i = E ξ [ F i ( x )] = (cid:90) F i ( x ) p ( x ; ξ ) dx, (3)where the function b ( ξ ) becomes a potential function ψ ( θ ). Note that if we take the m-connection as ∇ , the relation changes in a dual way, i.e., ζ becomes θ and ξ becomes η .Here, for p ∈ S , we denote the corresponding ∇ - and ˜ ∇ -coordinate by θ ( p ) and η ( p )respectively. Let us consider a submanifold defined by linear combination of K points p , . . . , p K ∈ S , M = { p | θ ( p ) = K (cid:88) k =1 w k θ ( p k ) , K (cid:88) k =1 w k = 1 } , (4)where w = ( w , . . . , w K ) is a weight vector whose sum is 1. The submanifold M is an affinesubspace and hence it is called an ∇ -autoparallel (or ∇ -flat) submanifold. In particular, if K = 2, M is a straight line of ∇ -coordinate that is called ∇ -geodesic.We can also consider another submanifold in the dual coordinate,˜ M = { p | η ( p ) = K (cid:88) k =1 w k η ( p k ) , K (cid:88) k =1 w k = 1 } , (5) hich is called a ˜ ∇ -autoparallel (or ˜ ∇ -flat) submanifold. The ˜ ∇ -geodesic is defined by astraight line of ˜ ∇ -coordinate.Now let us define a ∇ -projection and a ˜ ∇ -projection from a point q ∈ S onto a submanifold M . The ∇ -projection is a point q ∗ ∈ M such that ∇ -geodesic between q and q ∗ is orthogonalto M at q ∗ with respect to the Riemannian metric g ij ( θ ( q ∗ )). In the statistical manifold, g ij is taken as g ij ( θ ) = ∂ ψ ( θ ) ∂θ i ∂θ j , (6)which is equal to Fisher information for exponential family g ij ( ξ ) = E ξ (cid:20) ∂ log p ( x ; ξ ) ∂ξ i log p ( x ; ξ ) ∂ξ j (cid:21) . (7)In a similar way, ˜ ∇ -projection onto a submanifold M is defined as a point q ∗ so that the˜ ∇ -geodesic connecting q and q ∗ is orthogonal to M . Theorem 1 (Generalized Pythagorean theorem[12]) . Let ˜ M be a ˜ ∇ -autoparallel submanifoldof a statistical manifold S , and the ∇ -projection be q ∗ ∈ ˜ M from a point q ∈ S , then for anypoint p ∈ ˜ M , the following relation holds D ( p, q ) = D ( q ∗ , q ) + D ( p, q ∗ ) , (8)where D is the canonical divergence defined by D ( p, q ) = ψ ( θ ( q )) + φ ( η ( p )) − d (cid:88) i =1 η i ( p ) θ i ( q )= ψ ( θ ( q )) − ψ ( θ ( p )) + d (cid:88) i =1 η i ( p )( θ i ( p ) − θ i ( q ))= − φ ( η ( q )) + φ ( η ( p )) + d (cid:88) i =1 θ i ( q )( η i ( q ) − η i ( p )) . (9)By exchanging ∇ and ˜ ∇ , we have a dual relation, i.e, for a ∇ -autoparallel submanifold M , the ˜ ∇ -projection q ∗ ∈ M from a point q ∈ S satisfies the relation˜ D ( p, q ) = ˜ D ( q ∗ , q ) + ˜ D ( p, q ∗ ) , (10)where p ∈ M and ˜ D is a dual divergence defined by ˜ D ( p, q ) = D ( q, p ).From this theorem, we see that a ∇ -projection ( ˜ ∇ -projection) onto a ˜ ∇ -autoparallel ( ∇ -autoparallel respectively) submanifold is unique and can be found by minimizing correspond-ing divergence, i.e., the ∇ -projection is given by q ∗ = arg min p ∈ ˜ M D ( p, q ) (11)and the ˜ ∇ -projection is given by q ∗ = arg min p ∈ M ˜ D ( p, q ) . (12)For the exponential family (2), taking the e-connection as ∇ -connection, the divergenceis equal to the Kullback-Leibler divergence, D ( p, q ) = (cid:90) p ( x ; ξ ( p )) log p ( x ; ξ ( p )) p ( x ; ξ ( q )) dx. (13)If we take the e-connection as ∇ or ˜ ∇ connection, the corresponding projection and autopar-allel submanifold is called an e-projection and an e-autoparallel submanifold, and similarly,an m-projection and an m-autoparallel submanifold are defined for the m-connection. .2 Geometrical algorithm for projection Now we propose a geometrical algorithm to find a ∇ -projection (or ˜ ∇ -projection) onto a ˜ ∇ -autoparallel (and ∇ -autoparallel respectively) submanifold. To avoid redundant description,we only formulate the ∇ -projection onto a ˜ ∇ -autoparallel submanifold, since the dual casecan be obtained by only exchanging ∇ and ˜ ∇ .In this paper, we impose a restriction on the projection. Assumption 2.
The projection belongs to the convex hull of p , . . . , p K in (4) and (5), i.e.,all w k > q ∈ S does not necessarily belong to the convex hullof basis vectors in general, some application such as mixture models that will be explainedin Sec. 4.1 requires this assumption. We will discuss this restriction in Sec. 6.3. Figure 1: The ∇ -projection q ∗ from a point q to an ˜ ∇ -autoparallel manifold ˜ M spanned by { p k } ,where ˆ q is a current estimate of q ∗ . The value γ k defined in (14) represents the deviation fromPythagorean relation, i.e., γ k = 0 iff ˆ q = q ∗ , and γ k > q is closer to p k while γ k < q is further to p k . Suppose a point q ∈ S and a ˜ ∇ -autoparallel submanifold ˜ M ⊆ S are given, let ˆ q ∈ ˜ M bea current estimate of the projection q ∗ ∈ ˜ M (Fig. 1) and let us define the quantity γ k , γ k = D (ˆ q, q ) + D ( p k , ˆ q ) − D ( p k , q ) . (14)From Eq. (8), γ k = 0 if and only if ˆ q = q ∗ . If γ k <
0, that means ˆ q is closer to p k than q ∗ , w k should be decreased. On the other hand, if γ k >
0, ˆ q is farther from p k than q ∗ , w k shouldbe increased.From the consideration above, we can construct the Algorithm 1 to find the ∇ -projectionby optimizing weights { w k } k =1 ,...,K so that ˆ q satisfies the Pythagorean relation (8).In the algorithm, the function f ( γ ) is a positive and strictly monotonically increasingfunction s.t. f (0) = 1, which is introduced in order to stabilize the algorithm and a typicalchoice of f is a sigmoidal function, f ( γ ) = 21 + exp( − βγ ) , β > . (17)A parameter β controls the learning speed and it is related to convergence characteristicsof the algorithm. Algorithm A( K ) in the case that m-connection is taken as ∇ -connection lgorithm 1 Geometrical Algorithm A( K ) Initialize { w (0) k } k =1 ,...,K s.t. (cid:80) Kk =1 w (0) k = 1 , w (0) k > , t := 0 repeat Calculate γ k by (14), where η (ˆ q ) = (cid:80) Ki =1 w ( t ) k η ( p k ), k = 1 , . . . , K Update w k , k = 1 , . . . , K by w (cid:48) k = w ( t ) k f ( γ k ) (15) Normalize w (cid:48) k , k = 1 , . . . , K by w ( t +1) k = w (cid:48) k (cid:80) Kk =1 w (cid:48) k (16) t := t + 1 until Stopping criterion is satisfied return w was firstly introduced by Takano et al.[15] in order to estimate a nonparametric e-mixturedistribution. The main contribution of this paper is to clarify the relation between thefunction f and the convergence property. In later sections, we prove Algorithm A(2) (andalso Algorithm A( K )) is locally stable if the derivative of f at the origin is less than a certainbound. For later theoretical analysis, we show the following Lemma here. Lemma 3.
The value γ k in Algorithm A( K ) is given by γ k = d (cid:88) i =1 ( θ i (ˆ q ) − θ i ( q ∗ ))( η i (ˆ q ) − η i ( p k )) , (18)which means that γ k only depends on the points on ˜ M , if the true projection q ∗ is known. Proof.
For any p, r ∈ ˜ M and q ∈ S , let us define γ ( p, q, r ) = D ( r, q ) + D ( p, r ) − D ( p, q )= − φ ( η ( q )) + φ ( η ( r )) + d (cid:88) i =1 θ i ( q )( η i ( q ) − η i ( r )) − φ ( η ( r )) + φ ( η ( p )) + d (cid:88) i =1 θ i ( r )( η i ( r ) − η i ( p )) − (cid:40) − φ ( η ( q )) + φ ( η ( p )) + d (cid:88) i =1 θ i ( q )( η i ( q ) − η i ( p )) (cid:41) = d (cid:88) i =1 ( θ i ( r ) − θ i ( q ))( η i ( r ) − η i ( p )) . (19) he value γ k is given by γ k = γ ( p k , q, ˆ q ) = d (cid:88) i =1 ( θ i (ˆ q ) − θ i ( q ))( η i (ˆ q ) − η i ( p k ))= d (cid:88) i =1 ( θ i (ˆ q ) − θ i ( q ∗ ) + θ i ( q ∗ ) − θ i ( q ))( η i (ˆ q ) − η i ( p k ))= d (cid:88) i =1 ( θ i (ˆ q ) − θ i ( q ∗ ))( η i (ˆ q ) − η i ( p k ))+ d (cid:88) i =1 ( θ i ( q ∗ ) − θ i ( q ))( η i (ˆ q ) − η i ( q ∗ ) + η i ( q ∗ ) − η i ( p k ))= d (cid:88) i =1 ( θ i (ˆ q ) − θ i ( q ∗ ))( η i (ˆ q ) − η i ( p k )) − γ (ˆ q, q, q ∗ ) + γ ( p k , q, q ∗ ) . (20)From the Pythagorean theorem, γ (ˆ q, q, q ∗ ) = γ ( p k , q, q ∗ ) = 0 , (21)thus γ k becomes (18). K = 2 We start the analysis from the simplest case of K = 2. As shown later, the case of general K is reduced to this case. From (18), γ k is only depends on the points on the ˜ M , and if K = 2,˜ M is just a one-dimensional straight line of η . γ k In order to derive the condition for convergence, we examine the behavior of γ k for a smallperturbation.The weight value w can be regarded as an ˜ ∇ -coordinate of ˜ M , and let u be the ∇ -coordinate that is dual to w . Let w ∗ be the value of w at the projection point q ∗ , and thecurrent estimation ˆ w k (= w ( t ) k ) is perturbed slightly from w ∗ k ,ˆ w = w ∗ + (cid:15), ˆ w = w ∗ − (cid:15) = (1 − w ∗ ) − (cid:15), (22)then from (18), the value γ is given by γ = ( w (ˆ q ) − w ( p ))( u (ˆ q ) − u ( q ∗ )) = ( w ∗ + (cid:15) − u , (23)where w ( q ∗ ) = w ∗ and w ( p ) = 1 are the w value at q ∗ and p respectively, and∆ u = u (ˆ q ) − u ( q ∗ ) . (24)When (cid:15) is small, it can be expanded upto the first order of (cid:15) ,∆ u = g ( w )∆ w + o ( (cid:15) ) , (25)where ∆ w = ˆ w − w ∗ = (cid:15), (26) ( w ) is Jacobian that is equal to Riemannian metric g ( w ) = ∂u ∂w , (27)and is also obtained by g ( w ) = E w (cid:34)(cid:18) ∂ log p ( x ; w ) ∂w (cid:19) (cid:35) = − E w (cid:20) ∂ log p ( x ; w ) ∂w (cid:21) . (28)As a result, we have γ = g ( w ∗ )( w ∗ − (cid:15) + o ( (cid:15) ) . (29)Similarly, γ = g ( w ∗ ) w ∗ (cid:15) + o ( (cid:15) ) . (30) ) In this section, we show the condition for local convergence property of the Algorithm A(2).Here we call the algorithm is locally stable when the amount of sufficiently small perturbationfrom the optimal solution is decreased by the algorithm.
Theorem 4.
Algorithm A(2) is locally stable when it holds df (0) dγ < w ∗ (1 − w ∗ ) g ( w ∗ ) , (31)where w ∗ is the optimal weight. Proof.
By the Algorithm A(2), the weight w is updated by w (cid:48) = ˆ w f ( γ ) , (32)and its first order expansion is given from Eq. (29) by w (cid:48) = ˆ w (1 + df (0) dγ γ ) + o ( (cid:15) )= w ∗ + (cid:15) + w ∗ df (0) dγ g ( w ∗ )( w ∗ − (cid:15) + o ( (cid:15) ) , (33)and for w , w (cid:48) = ˆ w f ( γ )= 1 − w ∗ − (cid:15) + (1 − w ∗ ) g ( w ∗ ) w ∗ df (0) dγ (cid:15) + o ( (cid:15) ) . (34)We see that w (cid:48) + w (cid:48) = 1 + o ( (cid:15) ), thus the normalization procedure is negligible up to thefirst order of (cid:15) .The condition that q ∗ is a stable point of the algorithm is given by | w (cid:48) − w ∗ | < | ˆ w − w ∗ | = | (cid:15) | . (35)From Eq. (33), it is | (cid:15) + w ∗ df (0) dγ g ( w ∗ )( w ∗ − (cid:15) | < | (cid:15) | , (36) hich is equivalent to w ∗ df (0) dγ g ( w ∗ )(1 − w ∗ ) < , (37)then we have df (0) dγ < w ∗ (1 − w ∗ ) g ( w ∗ ) . (38)Since the true value q ∗ is not known when the algorithm is applied, we have two ap-proaches. The one is approximating w ∗ by the current estimate ˆ w and use adaptively chang-ing the derivative of f , which will be examined in sec.6.2. The other approach is to use abound that is independent of w ∗ , which is available in some special cases. Corollary 5.
Algorithm A(2) is locally stable when it holds df (0) dγ < w w (1 − w ) g ( w ) , (39)where we denote w = w for simplicity. In the following subsections, we give specific forms of the bound df (0) /dγ of Eq. (39) bothfor the e-projection and m-projection by considering a discrete distribution as a specific case.The discrete distribution is given by q ( x ) = d (cid:88) i =1 q i δ i ( x ) , x ∈ { , , . . . , d } , (40) d (cid:88) i =1 q i = 1 , q i ≥ . (41)where δ i ( x ) = 1 when x = i and δ i ( x ) = 0 otherwise. We see that the discrete distributionbelongs to the exponential family as follows: q ( x ) = exp (cid:32) d (cid:88) i =1 (log q i ) δ i ( x ) (cid:33) = exp (cid:32) d − (cid:88) i =1 (log q i ) δ i ( x ) + (log q d ) (cid:32) − d − (cid:88) i =1 δ i ( x ) (cid:33)(cid:33) = exp (cid:32) d − (cid:88) i =1 log q i q d δ i ( x ) + log q d (cid:33) , (42)where we have d − q , . . . , q d − and one dependent parameter q d is given by q d = 1 − (cid:80) d − i =1 q i . By taking the e-connection as the ∇ -connection, q ( x ) becomesthe same form as Eq. (2) by regarding F i ( x ) = δ i ( x ) , ξ i = log q i q d , b ( ξ ) = − log q d , i = 1 , . . . , d − . (43) he dual coordinate ζ i is given by ζ i = E q ( x ) [ F i ( x )] = q i , i = 1 , . . . , d − . (44)The basis vectors in S are denoted by p k ( x ) = d (cid:88) i =1 p ki δ i ( x ) , d (cid:88) i =1 p ki = 1 , p ki ≥ . (45) First, we take the e-connection as the ∇ -connection, then the ∇ -projection onto the ˜ ∇ -autoparallel submanifold is the e-projection onto the m-autoparallel submanifold.The m-autoparallel submanifold spanned by p k ( x ) is given by a set of points whose m-coordinate (44) is given by ζ i = K (cid:88) k =1 w k p ki , i = 1 , . . . , d − . (46)Since ζ i is the probability value, it is equivalent to the mixture distribution of { p k ( x ) } p ( x ; w ) = K (cid:88) k =1 w k p k ( x ) , K (cid:88) k =1 w k = 1 , (47)where w k is usually assumed to be positive, which matches the Assumption 2.The mixture distribution has a lot of applications, in which complicated distribution isdecomposed into sum of simple component distributions. An important application in thediscrete distribution case is the nonnegative matrix factorization[10], where a matrix X withnonnegative components is approximated by X (cid:39) DC, (48)where D and C are also matrices with nonnegative components. Let Π be the normalizationoperator by which sum of each column components become 1. It is known[5] that if X = DC ,there exist P and W such that Π( X ) = P W, (49)where P and W are matrices with nonnegative components and sum of each column compo-nents is 1. This means that a set of probability distributions are approximated by mixture offactor distributions. In the NMF, D and C are optimized alternatively by fixing the other.Each optimization problem can be regarded as e-projection to m-autoparallel manifold.Note that we consider the e-projection onto an m-autoparallel submanifold in this paper,since it is natural from the generalized Pythagorean relation. However, many learning algo-rithms are formulated to maximum likelihood that is equivalent to the m-projection, which isdifferent from e-projection in the sense that the argument of divergence is reversed. For thediscrete distribution case, the m-projection to the m-autoparallel submanifold has a uniquesolution, but it does not hold in general.Now we give a sufficient condition for convergence of the e-projection onto the m-autoparallelsubmanifold. roposition 6. The Algorithm A(2) of the e-projection onto an m-autoparallel submanifoldfor the discrete distribution locally stable if df (0) dγ < (cid:80) i ( √ p i − √ p i ) , (50)where the right hand side has a constant lower bound √ Proof.
The m-autoparallel model spanned by K = 2 points can be written as p ( x ; w ) = wp ( x ) + (1 − w ) p ( x ) . (51)The Riemannian metric at p ( x ; w ) is given by g ( w ) = E w (cid:34)(cid:18) ∂ log p ( x ; w ) ∂w (cid:19) (cid:35) = d (cid:88) x =1 p ( x ; w ) (cid:18) ∂p ( x ; w ) ∂w (cid:19) = d (cid:88) x =1 ( p ( x ) − p ( x )) p ( x ; w )= d (cid:88) i =1 ( p i − p i ) wp i + (1 − w ) p i . (52)The denominator of right hand side of Eq. (39) issup w w (1 − w ) d (cid:88) i =1 ( p i − p i ) wp i + (1 − w ) p i . (53)The i -th term w (1 − w ) ( p i − p i ) wp i + (1 − w ) p i (54)has maximum value ( √ p i − √ p i ) when w = √ p i / ( √ p i + √ p i ), then Eq. (53) is boundedfrom upper by (cid:88) i ( √ p i − √ p i ) , (55)which is a Hellinger distance between p ( x ) and p ( x ), and we obtain the sufficient conditionfor local stability, df (0) dγ < (cid:80) i ( √ p i − √ p i ) , (56)and the right hand side has a constant lower bound √ In this subsection, we take the m-connection as the ∇ -connection, then the ∇ -projection ontothe ˜ ∇ -autoparallel submanifold is the m-projection onto the e-autoparallel submanifold.The e-autoparallel submanifold spanned by p k ( x ) is given by a set of points whose e-coordinate (43) is given by ξ i = K (cid:88) k =1 w k log p ki p kd = (cid:32) K (cid:88) k =1 w k log p ki (cid:33) − log p kd , i = 1 , . . . , d − . (57) ince ξ i = log( q i /q d ), it is equivalent to the model specified by p ( x ; w ) ∝ exp (cid:32) K (cid:88) k =1 w k log p k ( x ) (cid:33) , K (cid:88) k =1 w k = 1 , (58)which is a different type of mixture, log linear mixture.We call this type of mixture as e-mixture, while the mixture specified by Eq. (47) asm-mixture. Although the e-mixture has not been studied as intensively as the m-mixture, ithas several good properties such as maximum entropy principle. Takano et al.[15] proposed anonparametric extension of the e-mixture and its learning algorithm based on the geometricalalgorithm, which is generalized in this paper. In the nonparametric e-mixture estimation, thebasis distributions are expressed by the empirical distribution (i.e., sum of delta functions),thus the e-mixture of basis distibutions cannot mathematically defined. Instead, it is de-fined by geometrical characteristics of e-mixture[11]. Therefore, it is not possible to obtainthe coordinate explicitly. Because the geometrical algorithm is coordinate-free, and it onlyrequires to calculate divergences, which can be estimated based on nonparametric entropyestimation[8, 7]. This is a strong motivation to propose the geometrical algorithm.Here we give a sufficient condition for convergence of the m-projection onto the e-autoparallelsubmanifold. Proposition 7.
The Algorithm A(2) of the m-projection onto the e-autoparallel submanifoldfor the discrete distribution is locally stable if df (0) dγ ≤ (cid:18) max i log p i p i − min i log p i p i (cid:19) . (59)The right hand side does not have a constant lower bound unlike the e-projection case,and it is left as an open problem whether there exists any constant bound. Proof.
The e-autoparallel model for K = 2 is written as p ( x ; w ) = 1 Z ( w ) exp( w log p ( x ) + (1 − w ) log p ( x )) , (60)where w is an e-coordinate, Z ( w ) is a normalization constant Z ( w ) = d (cid:88) x =1 exp( w log p ( x ) + (1 − w ) log p ( x )) . (61)Since the discrete distributionlog p k ( x ) = d (cid:88) i =1 log p ki δ i ( x ) , (62) p ( x ; w ) is written as p ( x ; w ) = 1 Z ( w ) exp (cid:32) d (cid:88) i =1 ( w log p i + (1 − w ) log p i ) δ i ( x ) (cid:33) = 1 Z ( w ) exp (cid:32) d (cid:88) i =1 ( a i w + b i ) δ i ( x ) (cid:33) , (63) here a i = log( p i /p i ) , b i = log p i , (64) Z ( w ) = d (cid:88) i =1 c i ( w ) , c i ( w ) = exp( a i w + b i ) . (65)Note that p ( i ; w ) = c i ( w ) /Z ( w )The Fisher information for this model can be calculated by g ( w ) = − E w (cid:20) ∂ log p ( x ; w ) ∂w (cid:21) = 1 Z ( w ) ∂ Z ( w ) ∂w − (cid:18) Z ( w ) ∂Z ( w ) ∂w (cid:19) = d (cid:88) i =1 a i c i ( w ) Z ( w ) − (cid:32) d (cid:88) i =1 a i c i ( w ) Z ( w ) (cid:33) = d (cid:88) i =1 a i p ( i ; w ) − (cid:32) d (cid:88) i =1 a i p ( i ; w ) (cid:33) (66)The last formula represents the variance of a i with respect to the probability weight p ( i ; w ).From Popoviciu’s inequality on variances[14], g ( w ) has an upper bound that is independentof w , g ( w ) ≤
14 (max i a i − min i a i ) . (67)Since w (1 − w ) ≤ /
4, we obtain the inequality (59) of the Proposition from Eq. (39). K We proceed to the general case which include K ≥
2. First we present the main theorem.
Theorem 8.
Let w ∗ k , k = 1 , . . . , K be the optimal parameter. If the function f satisfies df (0) dγ < K max k w ∗ k (1 − w ∗ k ) g ( w ∗ k ) , (68)Algorithm A( K ) is locally stable.The proof is in the appendix. Basic strategy of the proof is to show the equivalence betweenthe Algorithm A( K ) and a component-wise update algorithm based on Algorithm A(2). Inthe process of the proof, a possible refinement of the Algorithm A( K ) is also suggested. In this section, we will discuss several points related to the proposed framework, (1) relationto gradient descent method, (2) possible refinement of the algorithm, (3) assumption of thepositivity. .1 Relation to gradient descent method In general optimization problems, a gradient descent method is a simple way to solve theproblem. Here, we show that the updates of the gradient descent and the proposed algorithmare linearly related.The parameter { w k } should satisfy a constraint (cid:80) Kk =1 w k = 1. We first replace w K by1 − (cid:80) K − k =1 w k , then update w k for k = 1 , . . . , K − w (cid:48) k = w k − λ ∂D (ˆ q, q ) ∂w k , (69)and w (cid:48) K is obtained by 1 − (cid:80) K − k =1 w (cid:48) k .The gradient of D (ˆ q, q ) with respect to w k is given by ∂D (ˆ q, q ) ∂w k = ∂∂w k (cid:34) − φ ( η ( q )) + φ ( η (ˆ q )) + d (cid:88) i =1 θ i ( q )( η i ( q ) − η i (ˆ q )) (cid:35) = ∂∂w k (cid:34) φ ( η (ˆ q )) − d (cid:88) i =1 θ i ( q ) η i (ˆ q ) (cid:35) = d (cid:88) i =1 ∂η i (ˆ q ) ∂w k ∂∂η i (ˆ q ) (cid:34) φ ( η (ˆ q )) − d (cid:88) i (cid:48) =1 θ i (cid:48) ( q ) η i (cid:48) (ˆ q ) (cid:35) . (70)Since η (ˆ q ) = K (cid:88) k =1 w k η ( p k ) = K − (cid:88) k =1 w k ( η ( p k ) − η ( p K )) + η ( p K ) , (71)and ∂φ ( η ) /∂η i = θ i , we have ∂D (ˆ q, q ) ∂w k = d (cid:88) i =1 ( θ i (ˆ q ) − θ i ( q ))( η i ( p k ) − η i ( p K )) , (72)which can be, from Lemma 3, represented using γ k , ∂D (ˆ q, q ) ∂w k = γ K − γ k . (73)The amount of update by the gradient descent is∆ w Gk = − λ ∂D (ˆ q, q ) ∂w k = λ ( γ k − γ K ) , k = 1 , . . . , K − , (74)and ∆ w GK = − K − (cid:88) k =1 ∆ w Gk = λ K − (cid:88) k =1 ( γ K − γ k ) . (75)On the other hand, the amount of update by the Algorithm A( K ) is approximated for small γ k by ∆ w Ak = w (cid:48) k − w k = w k ( f ( γ k ) − (cid:39) w k ∂f (0) ∂γ γ k . (76)Since (cid:80) k w k γ k = 0, no further normalization is necessary. Comparing Eq. (74) and Eq. (75)with Eq. (76), we see that ∆ w Gk and ∆ w Ak are linearly related. Unlike the gradient descentmethod, the proposed framework does not need explicit calculation of the coodinate. .2 Possible refinement of the algorithm As explained in Sec. 3.2, the condition for convergence depends on the true parameter, thusone approach to use the adaptively change the derivative of f is to replace the true parameterby its estimate. This approach also requires to estimate the Fisher information.Another possibility for the refinement of the algorithm is based on the analysis in theappendix. It will be shown that the Algorithm A( K ) is equivalent to the slower version ofthe component-wise update algorithm. More specifically, the amount of the update ∆ w k issmaller by the factor 1 − w k . Therefore, the update rule in the Algorithm A( K ), w (cid:48) k = w k f ( γ k )can be replaced by w (cid:48) k = w k f (cid:18) γ k − w k (cid:19) , (77)which does not change the condition of the convergence. In Sec. 2.2, we assumed that the projection lies on the convex hull P ⊂ ˜ M spanned by thebasis vectors. In general, however, the projection point can be out of P . In such a case, wegeneralize the problem to find a point on ˜ M that minimizes the divergence, q ∗ = arg min p ∈ P D ( p, q ) . (78)When the projection point is out of P , the solution q ∗ of this problem is on the boundary of P and the ∇ -geodesic connecting q and q ∗ is not orthogonal to ˜ M any more.The proposed algorithm itself works even in this case, because the boundary is again aconvex hull of a subset of basis vectors. However, we have to be careful about one thing: oncea certain w k becomes 0, it cannot take positive value any longer, which means that if thecurrent estimate reaches to the boundary that does not include the optimal solution, thenthe estimator cannot escape from the boundary.Without the assumption of w i >
0, the ∇ -projection of q to a dual autoparallel sub-manifold M always exists uniquely and such a formulation is studied as e-PCA and m-PCAframework[1] or exponential family PCA in a special case[4]. However, the algorithm pro-posed in this paper cannot be applied as it is, because it is derived under the assumption.One method of update for this general case is as follows: w k should be increased for positive γ k and should be decreased for negative γ k . Also, (cid:80) k w k = 1 should be preserved. Therefore,let K P be a set of indices with positive γ k and K P be a set of indices with negative γ k . Then w (cid:48) k = w k + (cid:15) γ k (cid:80) l ∈K P γ l , k ∈ K P ,w (cid:48) k = w k − (cid:15) γ k (cid:80) l ∈K N γ l , k ∈ K N , (79)for a learning constant (cid:15) . There are several variations of such an update, and we also haveto take care the update does not make w k out of the domain of w k . The investigation ofconvergence property of the modified algorithm is left as a future work. We proposed a geometrical projection algorithm that only requires the calculation of diver-gences. We also showed the condition of the local stability of the algorithm. There arevarious applications in machine learning and related areas in which the projection onto anautoparallel submanifold is needed, and they are left as future works. cknowledgement This work was supported by JSPS KAKENHI Grant Numbers 17H01793, 19K12111.
References [1] Shotaro Akaho. The e-PCA and m-PCA: Dimension reduction of parameters by in-formation geometry. In
Neural Networks, 2004. Proceedings. 2004 IEEE InternationalJoint Conference on , volume 1, pages 129–134. IEEE, 2004.[2] Shunichi Amari.
Information geometry and its applications , volume 194. Springer, 2016.[3] Nihat Ay, J¨urgen Jost, Hˆong Vˆan Lˆe, and Lorenz Schwachh¨ofer.
Information geometry ,volume 64. Springer, 2017.[4] Michael Collins, Sanjoy Dasgupta, and Robert E Schapire. A generalization of principalcomponent analysis to the exponential family. In
NIPS , volume 13, page 23, 2001.[5] Bo Dong, Matthew M Lin, and Moody T Chu. Nonnegative rank factorization—a heuris-tic approach via rank reduction.
Numerical Algorithms , 65(2):251–274, 2014.[6] Akio Fujiwara and Shunichi Amari. Gradient systems in view of information geometry.
Physica D: Nonlinear Phenomena , 80(3):317–327, 1995.[7] Hideitsu Hino, Shotaro Akaho, and Noboru Murata. An Entropy Estimator Based onPolynomial Regression with Poisson Error Structure. In
Neural Information Processing- 23rd International Conference, ICONIP 2016, Kyoto, Japan, October 16-21, 2016,Proceedings, Part II , pages 11–19, 2016.[8] Hideitsu Hino, Kensuke Koshijima, and Noboru Murata. Non-parametric entropy es-timators based on simple linear regression.
Computational Statistics & Data Analysis ,89(0):72 – 84, 2015.[9] Guy Lebanon et al.
Riemannian geometry and statistical machine learning . LAP LAM-BERT Academic Publishing, 2015.[10] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negativematrix factorization.
Nature , 401(6755):788–791, 1999.[11] Noboru Murata and Yu Fujimoto. Bregman divergence and density integration.
Journalof Math-for-Industry (JMI) , 1(B):97–104, 2009.[12] Hiroshi Nagaoka and Shunichi Amari. Differential geometry of smooth families of prob-ability distributions. Technical Report METR 82-7, University of Tokyo, 1982.[13] Giovanni Pistone. Nonparametric information geometry. In
Geometric Science of Infor-mation , pages 5–36. Springer, 2013.[14] Tiberiu Popoviciu. Sur les ´equations alg´ebriques ayant toutes leurs racines r´eelles.
Math-ematica (Cluj) , 9:129–145, 1935.[15] Ken Takano, Hideitsu Hino, Shotaro Akaho, and Noboru Murata. Nonparametric e-mixture estimation.
Neural Computation , 28(12):2687–2725, 2016. lgorithm A( K )Algorithm A(2)Algorithm B( K , L )Algorithm Ba( K , L ) Algorithm C( K , L )Algorithm Cb( K )Equivalent K =2EquivalentUpdate rule 1 (approx)Update rule 2 (slow) Batch Figure 2: Relation of the algorithms q ˆ q (cid:120) q ∗ p k ˜ p k Figure 3: Component-wise algorithm16 ppendix Proof of Theorem 8
A.1 Component-wise algorithm
Suppose the point p † k be the point where the extension line from p k through ˆ q intersects withthe boundary of ˜ M (Fig. 3). Let us introduce the notation w k ( p ) to specify the point p of ˜ M , η ( p ) = K (cid:88) k =1 w k ( p ) η ( p k ) , (80)for instance w l ( p k ) = δ k ( l ) and w k (ˆ q ) = ˆ w k . Since η ( p k ), η (ˆ q ) and η ( p † k ) are on the same line,by taking an appropriate ρ , η ( p † k ) = (1 − ρ ) η ( p k ) + ρη (ˆ q ) (81)or equivalently, w l ( p † k ) = (1 − ρ ) w l ( p k ) + ρw l (ˆ q ) . (82)For the p † k , w k ( p † k ) should be zero,(1 − ρ ) w k ( p k ) + ρ ˆ w k = 0 , (83)that is ρ = 11 − ˆ w k , (84)and the point p † k is given by η ( p † k ) = 11 − ˆ w k (cid:88) l (cid:54) = k ˆ w l η ( p l ) (85)We can consider the component-wise update for K = 2 by using p k and p † k . Algorithm 2
Geometrical Algorithm B(
K, L ) Component-wise Initialize { w k } k =1 ,...,K s.t. (cid:80) Kk =1 w k = 1 , w k > repeat for k = 1 , . . . , K do Find p † k for count = 1 to L do Update w k by the Algorithm A(2) with two basis p k and p † k end for end for until Stopping criterion is satisfied return w
More detailed procedures of (a) and (b) are described later. In the algorithm, the number L controls how each component-wise update converges, which plays an important role for fastconvergence as will be demonstrated in Sec. A.3. Proposition 9.
If the condition of local stability for K = 2 is satisfied for all k , AlgorithmB( K, L ) is locally stable for any L ≥ ow let us give the procedures in Algorithm B( K, L ). The K = 2 algorithm between p k and p † k , the current solution ˆ q should be represented in the form of η (ˆ q ) = ω k η ( p k ) + ω † k η ( p † k ) , (86)where ω † k = 1 − ω k , and then calculate γ k and γ † k based on the Pythagorean relation, andthen apply the update (15) and (16). From Eq. (81), η (ˆ q ) = − − ρρ η ( p k ) + 1 ρ η ( p † k ) = ˆ w k η ( p k ) + (1 − ˆ w k ) η ( p † k ) , (87)then the weights for p k and p † k are obtained as ω k = ˆ w k and ω † k = 1 − ˆ w k respectively. Theupdate of ω k is written as ω (cid:48) k = ω k f ( γ k ) , ω † k (cid:48) = ω † k f ( γ † k ) , (88)and then normalization is performed as ω (cid:48)(cid:48) k = ω (cid:48) k ω (cid:48) k + ω † k (cid:48) , ω † k (cid:48)(cid:48) = ω † k (cid:48) ω (cid:48) k + ω † k (cid:48) = 1 − ω (cid:48)(cid:48) k . (89)From Eqs. (85) and (86), we have η (ˆ q ) = ω k η ( p k ) + ω † k − ˆ w k (cid:88) l (cid:54) = k ˆ w l η ( p l ) . (90)By updating ω k and ω † k to ω (cid:48)(cid:48) k and ω † k (cid:48)(cid:48) respectively, then the corresponding update of { w l } l =1 ,...,K is given by w (cid:48)(cid:48) k = ω (cid:48)(cid:48) k ,w (cid:48)(cid:48) l = ω † k (cid:48)(cid:48) − w k w l = 1 − w (cid:48)(cid:48) k − w k w l , l (cid:54) = k. (91)This update requires to calculate p † k (and related values), which increases the computationalcomplexity. For later discussions, let us rewrite the algorithm when the amount of updateis sufficiently small. From the discussion on the analysis of K = 2 (Eq.(34)), if the update∆ w k = ∆ ω k = ω k f ( γ k ) − ω k = w k f ( γ k ) − w k is sufficiently small,∆ ω † k = ω † k f ( γ † k ) − ω † k (cid:39) − ∆ w k (92)holds, where (cid:39) represents the neglecting higher order terms of ∆ w k . By this approximation,the update is simplified as follows: Update rule 1: ˆ w (cid:48)(cid:48) k = ˆ w k + ∆ w k , ˆ w (cid:48)(cid:48) l = 1 − ˆ w (cid:48)(cid:48) k − ˆ w k ˆ w l = ˆ w l − ∆ w k − ˆ w k ˆ w l , l (cid:54) = k. (93)Note that calculating p † k is not necessary any longer. Based on the Update rule 1, thealgorithm is simplified.Algorithm Ba( K, L ) behaves similarly to Algorithm B(
K, L ) locally and it requires smallercomputation cost.
Proposition 10.
If the condition of local stability for K = 2 is satisfied for all k , AlgorithmBa( K, L ) is locally stable for any L ≥ lgorithm 3 Geometrical Algorithm Ba(
K, L ) Component-wise approximated Initialize { w k } k =1 ,...,K s.t. (cid:80) Kk =1 w k = 1 , w k > repeat for k = 1 , . . . , K do for count = 1 to L do Update w k by Update rule 1 end for end for until Stopping criterion is satisfied return w A.2 One-side component-wise update
The component-wise update without any approximation requires to find p † k , which may causeadditional complexity compared to the Algorithm A( K ). Here we consider a simpler algo-rithm: only the k -th weight is updated with fixing other weights and normalize all weights,that is, Update rule 2: ˆ w (cid:48) k = ˆ w k + ∆ w k , ˆ w (cid:48)(cid:48) k = ˆ w (cid:48) k ˆ w (cid:48) k + (cid:80) l (cid:54) = k ˆ w l = ˆ w (cid:48) k w k , ˆ w (cid:48)(cid:48) l = ˆ w l w k , l (cid:54) = k. (94)This update does not require the computation of p † k . We examine the relation betweenUpdate 1 and 2. Assuming ∆ w k is sufficiently small, the Update 2 is approximated byˆ w (cid:48)(cid:48) k (cid:39) ˆ w k + ∆ w k w k (cid:39) ( ˆ w k + ∆ w k )(1 − ∆ w k ) (cid:39) ˆ w k + (1 − ˆ w k )∆ w k , (95) w (cid:48)(cid:48) l (cid:39) (1 − ∆ w k ) ˆ w l = ˆ w l − ∆ w k ˆ w l , l (cid:54) = k, (96)which means that the Update rule 2 is equivalent to the Update rule 1 where the learningconstant is shortened by a factor 1 − ˆ w k .Therefore, we see that if the Update rule 1 is locally stable, the Update rule 2 is alsolocally stable.In a similar way with component-wise algorithm, we can obtain one-side component-wisealgorithm for general K based on Update rule 2.Note that updating w k affects the value of other w l ( l (cid:54) = k ) because of the normalization. Proposition 11.
If the condition of local stability for K = 2 is satisfied for all k , AlgorithmC( K, L ) is locally stable for any L ≥ A.3 Local stability of the Algorithm A( K ) Now we are ready to prove the local stability of Algorithm A( K ).The Algorithm C( K,
1) is a sequential algorithm, and we can construct corresponding“batch” version of the algorithm.Since Algorithm Cb( K ) updates the weights by K perturbations, the condition for localstability is changed. The following lemma gives a sufficient condition. lgorithm 4 Geometrical Algorithm C(
K, L ) Component-wise one-side Initialize { w k } k =1 ,...,K s.t. (cid:80) Kk =1 w k = 1 , w k > repeat for k = 1 , . . . , K do for count = 1 to L do Update w k by Update rule 2 end for end for until Stopping criterion is satisfied return wAlgorithm 5 Geometrical Algorithm Cb( K ) Initialize { w (0) k } k =1 ,...,K s.t. (cid:80) Kk =1 w (0) k = 1 , w (0) k > , t := 0 repeat for k = 1 to K do Calculate the amount of changing values ∆ w l ( k ) of weight w l in theupdate of w k (Eq.(94)). end for Calculate γ k by (14), where θ (ˆ q ) = (cid:80) Ki =1 w ( t ) k θ ( p k ) Update the weights by w (cid:48) k = w ( t ) k + K (cid:88) l =1 ∆ w k ( l ) , w (cid:48) l = w ( t ) l , l (cid:54) = k (97) Normalize w (cid:48) k w ( t +1) k = w (cid:48) k (cid:80) Kk =1 w (cid:48) k (98) t := t + 1 until Stopping criterion is satisfied return w emma 12. Algorithm Cb( K ) is locally stable if f satisfies df (0) dγ < K max k w ∗ k (1 − w ∗ k ) g ( w ∗ k ) , (99)where w ∗ k denotes the optimal value.This bound is given by multiplying 1 /K to (39), but it might be very strict, because(39) for K = 2 is a better bound. Further, as wee see, Algorithm Cb( K ) is very similar toAlgorithm C( K, K > Proof.
By the update of w k , suppose the weight is changed from ˆ w to ˆ w + ∆ w k , where (cid:80) Kl =1 (∆ w k ) l = 0 because of the weight constraint. If f satisfies the condition (39), it holds (cid:107) ˆ w − w ∗ (cid:107) > (cid:107) ˆ w + ∆ w k − w ∗ (cid:107) (100)for all k . By multiplying 1 /K to the value of df (0) /dγ , the change of weights becomesˆ w + ∆ w k /K in terms of the first order approximation, and the simultaneous update of theall weight, it becomes ˆ w + (cid:80) Kk =1 ∆ w k /K . Therefore, the new weight satisfies (cid:107) ˆ w + K (cid:88) k =1 K ∆ w k − w ∗ (cid:107) = (cid:107) K K (cid:88) k =1 ( ˆ w + ∆ w k − w ∗ ) (cid:107) = 1 K (cid:107) K (cid:88) k =1 ( ˆ w + ∆ w k − w ∗ ) (cid:107) < K K max k (cid:107) ˆ w + ∆ w k − w ∗ (cid:107) < (cid:107) ˆ w − w ∗ (cid:107) , (101)which shows local stability of Algorithm Cb( K ).The main theorem 8 is proved by showing equivalence between Algorithm A( K ) andAlgorithm Cb( K ) as follows. Lemma 13.
When the update amounts of { w k } are sufficiently small, The Algorithm A( K )is equivalent to Algorithm Cb( K ). Proof.
By the Algorithm A( K ), the weights are updated by w (cid:48) k = w k + ∆ w k , (102) w (cid:48)(cid:48) k = w (cid:48) k (cid:80) Kl =1 w (cid:48) l = w k + ∆ w k (cid:80) Kl =1 ( w l + ∆ w l )= w k + ∆ w l (cid:80) Kl =1 ∆ w l (cid:39) ( w k + ∆ w k )(1 − K (cid:88) l =1 ∆ w l ) (cid:39) w k + ∆ w k − K (cid:88) l =1 ∆ w l w k . (103) n the other hand, the update of w k in the Update rule 2 for small ∆ w k is given by (95) and(96). Therefore, the amount of change of w l for the update of w k is given by∆ w l ( k ) = − ∆ w k w l , (104)then summing up them and we have the update of of the Algorithm Cb( K ) by w (cid:48)(cid:48)(cid:48) k (cid:39) w k + (1 − w k )∆ w k − (cid:88) l (cid:54) = k ∆ w l w k = w k + ∆ w k − K (cid:88) l =1 ∆ w l w k , (105)which coincides the update of (103).(105)which coincides the update of (103).