Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application
aa r X i v : . [ c s . L G ] S e p Projection onto the probability simplex:An efficient algorithm with a simple proof, and an application
Weiran Wang Miguel ´A. Carreira-Perpi˜n´anElectrical Engineering and Computer Science, University of California, Merced { wwang5,mcarreira-perpinan } @ucmerced.edu September 3, 2013
Abstract
We provide an elementary proof of a simple, efficient algorithm for computing the Euclidean projectionof a point onto the probability simplex. We also show an application in Laplacian K -modes clustering. Consider the problem of computing the Euclidean projection of a point y = [ y , . . . , y D ] ⊤ ∈ R D onto theprobability simplex, which is defined by the following optimization problem:min x ∈ R D k x − y k (1a)s.t. x ⊤ = 1 (1b) x ≥ . (1c)This is a quadratic program and the objective function is strictly convex, so there is a unique solution whichwe denote by x = [ x , . . . , x D ] ⊤ with a slight abuse of notation. The following O ( D log D ) algorithm finds the solution x to the problem: Algorithm 1
Euclidean projection of a vector onto the probability simplex.
Input: y ∈ R D Sort y into u : u ≥ u ≥ · · · ≥ u D Find ρ = max { ≤ j ≤ D : u j + j (1 − P ji =1 u i ) > } Define λ = ρ (1 − P ρi =1 u i ) Output: x s.t. x i = max { y i + λ, } , i = 1 , . . . , D .The complexity of the algorithm is dominated by the cost of sorting the components of y . The algorithm isnot iterative and identifies the active set exactly after at most D steps. It can be easily implemented (seesection 4).The algorithm has the following geometric interpretation. The solution can be written as x i = max { y i + λ, } , i = 1 , . . . , D , where λ is chosen such that P Di =1 x i = 1. Place the values y , . . . , y D as points on the Xaxis. Then the solution is given by a rigid shift of the points such that the points to the right of the Y axissum to 1.The pseudocode above appears in Duchi et al. (2008), although earlier papers (Brucker, 1984; Pardalosand Kovoor, 1990) solved the problem in greater generality . ther algorithms The problem (1) can be solved in many other ways, for example by particularizing QPalgorithms such as active-set methods, gradient-projection methods or interior-point methods. It can alsobe solved by alternating projection onto the two constraints in a finite number of steps (Michelot, 1986).Another way (Boyd and Vandenberghe, 2004, Exercise 4.1, solution available at http://see.stanford.edu/materials/lsocoee364b/hw4sol.pdf ) is to construct a Lagrangian formulation by dualizing the equalityconstraint and then solve a 1D nonsmooth optimization over the Lagrange multiplier. Algorithm 1 has theadvantage of being very simple, not iterative, and identifying exactly the active set at the solution after atmost D steps (each of cost O (1)) after sorting. A proof of correctness of Algorithm 1 can be found in Shalev-Shwartz and Singer (2006) and Chen and Ye(2011), but we offer a simpler proof which involves only the KKT theorem.We apply the standard KKT conditions for the problem (Nocedal and Wright, 2006). The Lagrangian ofthe problem is L ( x , λ, β ) = 12 k x − y k − λ ( x ⊤ − − β ⊤ x where λ and β = [ β , . . . , β D ] ⊤ are the Lagrange multipliers for the equality and inequality constraints,respectively. At the optimal solution x the following KKT conditions hold: x i − y i − λ − β i = 0 , i = 1 , . . . , D (2a) x i ≥ , i = 1 , . . . , D (2b) β i ≥ , i = 1 , . . . , D (2c) x i β i = 0 , i = 1 , . . . , D (2d) D X i =1 x i = 1 . (2e)From the complementarity condition (2d), it is clear that if x i >
0, we must have β i = 0 and x i = y i + λ > x i = 0, we must have β i ≥ x i = y i + λ + β i = 0, whence y i + λ = − β i ≤
0. Obviously, thecomponents of the optimal solution x that are zeros correspond to the smaller components of y . Withoutloss of generality, we assume the components of y are sorted and x uses the same ordering , i.e., y ≥ · · · ≥ y ρ ≥ y ρ +1 ≥ · · · ≥ y D ,x ≥ · · · ≥ x ρ > x ρ +1 = · · · = x D , and that x ≥ · · · ≥ x ρ > x ρ +1 = · · · = x D = 0. In other words, ρ is the number of positive componentsin the solution x . Now we apply the last condition and have1 = D X i =1 x i = ρ X i =1 x i = ρ X i =1 ( y i + λ )which gives λ = ρ (1 − P ρi =1 y i ). Hence ρ is the key to the solution. Once we know ρ (there are only D possiblevalues of it), we can compute λ , and the optimal solution is obtained by just adding λ to each componentof y and thresholding as in the end of Algorithm 1. (It is easy to check that this solution indeed satisfies allKKT conditions.) In the algorithm, we carry out the tests for j = 1 , . . . , D if t j = y j + j (1 − P ji =1 y i ) > ρ . The following theorem isessentially Lemma 3 of Shalev-Shwartz and Singer (2006). Theorem . Let ρ be the number of positive components in the solution x , then ρ = max { ≤ j ≤ D : y j + j (1 − P ji =1 y i ) > } . roof. Recall from the KKT conditions (2) that λρ = 1 − P ρi =1 y i , y i + λ > i = 1 , . . . , ρ and y i + λ ≤ i = ρ + 1 , . . . , D . In the sequel, we show that for j = 1 , . . . , D , the test will continue to be positive until j = ρ and then stay non-positive afterwards, i.e., y j + j (1 − P ji =1 y i ) > j ≤ ρ , and y j + j (1 − P ji =1 y i ) ≤ j > ρ .(i) For j = ρ , we have y ρ + 1 ρ (cid:18) − ρ X i =1 y i (cid:19) = y ρ + λ = x ρ > . (ii) For j < ρ , we have y j + 1 j (cid:18) − j X i =1 y i (cid:19) = 1 j (cid:18) jy j + 1 − j X i =1 y i (cid:19) = 1 j (cid:18) jy j + ρ X i = j +1 y i + 1 − ρ X i =1 y i (cid:19) = 1 j (cid:18) jy j + ρ X i = j +1 y i + ρλ (cid:19) = 1 j (cid:18) j ( y j + λ ) + ρ X i = j +1 ( y i + λ ) (cid:19) . Since y i + λ > i = j, . . . , ρ , we have y j + j (1 − P ji =1 y i ) > j > ρ , we have y j + 1 j (cid:18) − j X i =1 y i (cid:19) = 1 j (cid:18) jy j + 1 − j X i =1 y i (cid:19) = 1 j (cid:18) jy j + 1 − ρ X i =1 y i − j X i = ρ +1 y i (cid:19) = 1 j (cid:18) jy j + ρλ − j X i = ρ +1 y i (cid:19) = 1 j (cid:18) ρ ( y j + λ ) + j X i = ρ +1 ( y j − y i ) (cid:19) . Notice y j + λ ≤ j > ρ , and y j ≤ y i for j ≥ i since y is sorted, therefore y j + j (1 − P ji =1 y i ) < Remarks
1. We denote λ j = j (1 − P ji =1 y i ). At the j -th test, λ j can be considered as a guess of the true λ (indeed, λ ρ = λ ). If we use this guess to compute a tentative solution ¯ x where ¯ x i = max { y i + λ j , } , then it iseasy to see that ¯ x i > i = 1 , . . . , j , and P ji =1 ¯ x i = 1. In other words, the first j components of ¯ x are positive and sum to 1. If we find ¯ x j +1 = 0 (or y j +1 + λ j ≤ j = ρ because ¯ x satisfies all KKT conditions.2. To extend the algorithm to a simplex with a different scale, i.e., x ⊤ = a for a >
0, replace the 1 − P u i terms with a − P u i in Algorithm 1. The following vectorized Matlab code implements algorithm 1. It projects each row vector in the N × D matrix Y onto the probability simplex in D dimensions. function X = SimplexProj(Y)[N,D] = size(Y);X = sort(Y,2,’descend’);Xtmp = (cumsum(X,2)-1)*diag(sparse(1./(1:D)));X = max(bsxfun(@minus,Y,Xtmp(sub2ind([N,D],(1:N)’,sum(X>Xtmp,2)))),0); An application: Laplacian K -modes clustering Consider the
Laplacian K -modes clustering algorithm of Wang and Carreira-Perpi˜n´an (2013) (which is anextension of the K -modes algorithm of Carreira-Perpi˜n´an and Wang, 2013a). Given a dataset x , . . . , x N ∈ R D and suitably defined affinity values w nm ≥ x n and x m (for example,Gaussian), we define the objective functionmin Z , C λ N X m =1 N X n =1 w mn k z m − z n k − N X n =1 K X k =1 z nk G (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) x n − c k σ (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) (3a)s.t. K X k =1 z nk = 1 , for n = 1 , . . . , N (3b) z nk ≥ , for n = 1 , . . . , N, k = 1 , . . . , K. (3c) Z is a matrix of N × K , with Z ⊤ = ( z , . . . , z N ), where z n = ( z n , . . . , z nK ) ⊤ are the soft assignments ofpoint x n to clusters 1 , . . . , K , and c , . . . , c K ∈ R D are modes of the kernel density estimates defined for eachcluster (where G ( · ) gives a Gaussian). The problem of projection on the simplex appears in the trainingproblem, i.e., in finding a (local) minimum ( Z , C ) of (3), and in the out-of-sample problem, i.e., in assigninga new point to the clusters. Training
The optimization of (3) is done by alternating minimization over Z and C . For fixed C , theproblem over Z is a quadratic program of N K variables:min Z λ tr (cid:0) Z ⊤ LZ (cid:1) − tr (cid:0) B ⊤ Z (cid:1) (4a)s.t. Z1 K = N (4b) Z ≥ , (4c)where b nk = G ( k ( x n − c k ) /σ k ), n = 1 , . . . , N , k = 1 , . . . , K , and L is the graph Laplacian computed fromthe pairwise affinities w mn . The problem is convex since L is positive semidefinite. One simple and quiteefficient way to solve it is to use an (accelerated) gradient projection algorithm. The basic gradient projectionalgorithm iteratively takes a step in the negative gradient from the current point and projects the new pointonto the constraint set. In our case, this projection is simple: it separates over each row of Z (i.e., the softassignment of each data point, n = 1 , . . . , N ), and corresponds to a projection on the probability simplex ofa K -dimensional row vector, i.e., the problem (1). Out-of-sample mapping
Given a new, test point x ∈ R D , we wish to assign it to the clusters foundduring training. We are given the affinity vectors w = ( w n ) and g = ( g k ), where w n is the affinity between x and x n , n = 1 , . . . , N , and g k = G ( k ( x − c k ) /σ k ), k = 1 , . . . , K , respectively. A natural and efficient wayto define an out-of-sample mapping z ( x ) is to solve a problem of the form (3) with a dataset consisting ofthe original training set augmented with x , but keeping Z and C fixed to the values obtained during training(this avoids having to solve for all points again). Hence, the only free parameter is the assignment vector z for the new point x . After dropping constant terms, the optimization problem (3) reduces to the followingquadratic program over K variables: min z k z − (¯ z + γ g ) k (5a)s.t. z ⊤ K = 1 (5b) z ≥ (5c)where γ = 1 / λ P Nn =1 w n and ¯ z = Z ⊤ ww ⊤ = N X n =1 w n P Nn ′ =1 w n ′ z n
4s a weighted average of the training points’ assignments, and so ¯ z + γ g is itself an average between this andthe point-to-centroid affinities. Thus, the solution is the projection of the K -dimensional vector ¯ z + γ g ontothe probability simplex, i.e., the problem (1) again. LASS
In general, the optimization problem of eq. (4) is called
Laplacian assignment model (LASS) byCarreira-Perpi˜n´an and Wang (2013b). Here, we consider N items and K categories and want to learn a softassignment z nk of each item to each category, given an item-item graph Laplacian matrix L of N × N andan item-category similarity matrix B of N × K . The training is as in eq. (4) and the out-of-sample mappingas in eq. (5), and the problem of projection on the simplex appears in both cases. References
S. Boyd and L. Vandenberghe.
Convex Optimization . Cambridge University Press, Cambridge, U.K., 2004.P. Brucker. An O ( n ) algorithm for quadratic knapsack problems. Operations Research Letters , 3(3):163–166,Aug. 1984.M. ´A. Carreira-Perpi˜n´an and W. Wang. The K -modes algorithm for clustering. Unpublished manuscript,arXiv:1304.6478, Apr. 23 2013a.M. ´A. Carreira-Perpi˜n´an and W. Wang. A simple assignment model with Laplacian smoothing. Unpublishedmanuscript, 2013b.Y. Chen and X. Ye. Projection onto a simplex. Unpublished manuscript, arXiv:1101.6081, Feb. 10 2011.J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the ℓ -ball for learning inhigh dimensions. In A. McCallum and S. Roweis, editors, Proc. of the 25th Int. Conf. Machine Learning(ICML’08) , pages 272–279, Helsinki, Finland, July 5–9 2008.C. Michelot. A finite algorithm for finding the projection of a point onto the canonical simplex of R n . J.Optimization Theory and Applications , 50(1):195–200, July 1986.J. Nocedal and S. J. Wright.
Numerical Optimization . Springer Series in Operations Research and FinancialEngineering. Springer-Verlag, New York, second edition, 2006.P. M. Pardalos and N. Kovoor. An algorithm for a singly constrained class of quadratic programs subjectto upper and lower bounds.
Math. Prog. , 46(1–3):321–328, Jan. 1990.S. Shalev-Shwartz and Y. Singer. Efficient learning of label ranking by soft projections onto polyhedra.