[PDF] Robust Discriminative Clustering with Sparse Regularizers

Abstract

Clustering high-dimensional data often requires some form of dimensionality reduction, where clustered variables are separated from "noise-looking" variables. We cast this problem as finding a low-dimensional projection of the data which is well-clustered. This yields a one-dimensional projection in the simplest situation with two clusters, and extends naturally to a multi-label scenario for more than two clusters. In this paper, (a) we first show that this joint clustering and dimension reduction formulation is equivalent to previously proposed discriminative clustering frameworks, thus leading to convex relaxations of the problem, (b) we propose a novel sparse extension, which is still cast as a convex relaxation and allows estimation in higher dimensions, (c) we propose a natural extension for the multi-label scenario, (d) we provide a new theoretical analysis of the performance of these formulations with a simple probabilistic model, leading to scalings over the form d=O( n − − √ ) for the affine invariant case and d=O(n) for the sparse case, where n is the number of examples and d the ambient dimension, and finally, (e) we propose an efficient iterative algorithm with running-time complexity proportional to O(n d 2 ) , improving on earlier algorithms which had quadratic complexity in the number of examples.

Full PDF

aa r X i v : . [ s t a t . M L ] A ug Robust Discriminative Clustering with Sparse Regularizers

Nicolas Flammarion [email protected]

Balamurugan Palaniappan [email protected]

Francis Bach [email protected]

INRIA - Sierra Project-teamD´epartement d’Informatique de l’Ecole Normale Sup´erieure (CNRS - ENS - INRIA)Paris, France

Abstract

Clustering high-dimensional data often requires some form of dimensionality reduction,where clustered variables are separated from “noise-looking” variables. We cast this prob-lem as ﬁnding a low-dimensional projection of the data which is well-clustered. This yieldsa one-dimensional projection in the simplest situation with two clusters, and extends nat-urally to a multi-label scenario for more than two clusters. In this paper, (a) we ﬁrstshow that this joint clustering and dimension reduction formulation is equivalent to pre-viously proposed discriminative clustering frameworks, thus leading to convex relaxationsof the problem; (b) we propose a novel sparse extension, which is still cast as a convexrelaxation and allows estimation in higher dimensions; (c) we propose a natural extensionfor the multi-label scenario; (d) we provide a new theoretical analysis of the performanceof these formulations with a simple probabilistic model, leading to scalings over the form d = O ( √ n ) for the aﬃne invariant case and d = O ( n ) for the sparse case, where n is thenumber of examples and d the ambient dimension; and ﬁnally, (e) we propose an eﬃcientiterative algorithm with running-time complexity proportional to O ( nd ), improving onearlier algorithms which had quadratic complexity in the number of examples.

1. Introduction

Clustering is an important and commonly used pre-processing tool in many machine learningapplications, with classical algorithms such as K -means (MacQueen, 1967), linkage algo-rithms (Gower and Ross, 1969) or spectral clustering (Ng et al., 2002). In high dimensions,these unsupervised learning algorithms typically have problems identifying the underlyingoptimal discrete nature of the data; for example, they are quickly perturbed by addinga few noisy dimensions. Clustering high-dimensional data thus requires some form of di-mensionality reduction, where clustered variables are separated from “noise-looking” (e.g.,Gaussian) variables.Several frameworks aim at linearly separating noise from signal, that is ﬁnding pro-jections of the data that extracts the signal and removes the noise. They diﬀer in theways signals and noise are deﬁned. A line of work that dates back to projection pursuit(Friedman and Stuetzle, 1981) and independent component analysis (Hyv¨arinen et al., 2004)deﬁnes the noise as Gaussian while the signal is non-Gaussian (Blanchard et al., 2006;Le Roux and Bach, 2013; Diederichs et al., 2013). In this paper, we follow the work ofDe la Torre and Kanade (2006); Ding and Li (2007), along the alternative route where one lammarion, Palaniappan and Bach deﬁnes the signal as being clustered while the noise is any non-clustered variable. In thesimplest situation with two clusters, we may project the data into a one-dimensional sub-space. Given a data matrix X ∈ R n × d composed of n d -dimensional points, the goal is toﬁnd a direction w ∈ R d such that Xw ∈ R n is well-clustered, e.g., by K -means. This isequivalent to identifying both a direction to project, represented as w ∈ R d and the labeling y ∈ {− , } n that represents the partition into two clusters.Most existing formulations are non-convex and typically perform a form of alternatingoptimization (De la Torre and Kanade, 2006; Ding and Li, 2007), where given y ∈ {− , } n ,the projection w is found by linear discriminant analysis (or any binary classiﬁcationmethod), and given the projection w , the clustering is obtained by thresholding Xw orrunning K -means on Xw . As shown in Section 2, this alternating minimization procedurehappens to be equivalent to maximizing the (centered) correlation between y ∈ {− , } n and the projection Xw ∈ R d , that ismax w ∈ R d ,y ∈{− , } n ( y ⊤ Π n Xw ) k Π n y k k Π n Xw k , where Π n = I n − n n ⊤ n is the usual centering projection matrix (with 1 n ∈ R n being thevector of all ones, and I n the n × n identity matrix). This correlation is equal to one whenthe projection is perfectly clustered (independently of the number of elements per cluster).Existing methods are alternating minimization algorithms with no theoretical guarantees.In this paper, we relate this formulation to discriminative clustering formulations (Xu et al.,2004; Bach and Harchaoui, 2007), which consider the problemmin v ∈ R d , b ∈ R , y ∈{− , } n n k y − Xv − b n k , (1)with the intuition of ﬁnding labels y which are easy to predict by an aﬃne function of thedata. In particular, we show that given the relationship between the number of positivelabels and negative labels (i.e., the squared diﬀerence between the respective number ofelements), these two problems are equivalent, and hence discriminative clustering explicitlyperforms joint dimension reduction and clustering.While the discriminative framework is based on convex relaxations and has led to inter-esting developments and applications (Zhang et al., 2009; Joulin et al., 2010a,b; Wang et al.,2010), it has several shortcomings: (a) the running-time complexity of the semi-deﬁnite for-mulations is at least quadratic in n , and typically much more, (b) no theoretical analysishas ever been performed, (c) no convex sparse extension has been proposed to handle datawith many irrelevant dimensions, (d) balancing of the clusters remains an issue, as it typi-cally adds an extra hyperparameter which may be hard to set. In this paper, we focus onaddressing these concerns.When there are more than two clusters, one considers either the multi-label or the multi-class settings. The multi-class problem assumes that the data are clustered into distinctclasses, i.e., a single class per observation, whereas the multi-label problem assumes thedata share diﬀerent labels, i.e., multiple labels per observation. We show in this work thatdiscriminative clustering framework extends more naturally to multi-label scenarios andthis extension will have the same convex relaxation.A summary of the contributions of this paper follows: In Section 2, we relate discriminative clustering with the square loss to a joint clus-tering and dimension reduction formulation. The proposed formulation takes care ofthe balancing hyperparameter implicitly. − We propose in Section 3 a novel sparse extension to discriminative clustering and showthat it can still be cast through a convex relaxation. − When there are more than two clusters, we extend naturally the sparse formulationto a multi-label scenario in Section 4. − We then proceed to provide a theoretical analysis of the proposed formulations witha simple probabilistic model in Section 5, which eﬀectively leads to scalings over theform d = O ( √ n ) for the aﬃne invariant case and d = O ( n ) for the 1-sparse case. − Finally, we propose in Section 6 eﬃcient iterative algorithms with running-time com-plexity for each step equal to O ( nd ), the ﬁrst to be linear in the number of observa-tions n .Throughout this paper we assume that X ∈ R n × d is centered , a common pre-processingstep in unsupervised (and supervised) learning. This implies that X ⊤ n = 0 and Π n X = X .

2. Joint Dimension Reduction and Clustering

In this section, we focus on the single binary label case, where we ﬁrst study the usual non-convex formulation, before deriving convex relaxations based on semi-deﬁnite programming.

Following De la Torre and Kanade (2006); Ding and Li (2007); Ye et al. (2008), we considera cost function which depends on y ∈ {− , } n and w ∈ R d , which is such that alternatingoptimization is exactly (a) running K -means with two clusters on Xw to obtain y given w (when we say “running K -means”, we mean solving the vector quantization problem ex-actly), and (b) performing linear discriminant analysis to obtain w given y . Proposition 1 (Joint clustering and dimension reduction)

Given X ∈ R n × d suchthat X ⊤ n = 0 and X has rank d , consider the optimization problem max w ∈ R d ,y ∈{− , } n ( y ⊤ Xw ) k Π n y k k Xw k . (2) Given y , the optimal w is obtained as w = ( X ⊤ X ) − X ⊤ y , while given w , the optimal y isobtained by running K -means on Xw . Proof

Given y , we need to optimize the Rayleigh quotient w ⊤ X ⊤ yy ⊤ Xww ⊤ X ⊤ Xw with a rank-onematrix in the numerator, which leads to w = ( X ⊤ X ) − X ⊤ y . Given w , we show in Ap-pendix A, that the averaged distortion measure of K -means once the means have beenoptimized is exactly equal to ( y ⊤ Xw ) / k Π n y k . lammarion, Palaniappan and Bach Algorithm.

The proposition above leads to an alternating optimization algorithm. Notethat K -means in one dimension may be run exactly in O ( n log n ) (Bellman, 1973). More-over, after having optimized with respect to w in Eq. (2), we then need to maximize withrespect to y the function y ⊤ X ( X ⊤ X ) − X ⊤ y k Π n y k , which happens to be exactly performing K -meanson the whitened data (which is now in high dimension and not in 1 dimension). At ﬁrst, itseems that dimension reduction is simply equivalent to whitening the data and performing K -means; while this is a formally correct statement, the resulting K -means problem is noteasy to solve as the clustered dimension is hidden in noise; for example, algorithms such as K -means++ (Arthur and Vassilvitskii, 2007), which have a multiplicative theoretical guar-antee on the ﬁnal distortion measure, are not provably eﬀective here because the minimalﬁnal distortion is then not small, and the multiplicative guarantee is meaningless. The discriminative clustering formulation in Eq. (1) may be optimized for any y ∈ {− , } n in closed form with respect to b as b = ⊤ n ( y − Xv ) n = ⊤ n yn since X is centered. Substituting b in Eq. (1) leads us tomin v ∈ R d n k Π n y − Xv k = 1 n k Π n y k − max w ∈ R d ( y ⊤ Xw ) k Xw k , (3)where v is obtained from any solution w as v = w y ⊤ Xw k Xw k . Thus, given( y ⊤ n ) n = 1 n (cid:0) { i, y i = 1 } − { i, y i = − } (cid:1) = α ∈ [0 , , (4)which characterizes the asymmetry between clusters and with k Π n y k = n (1 − α ), we obtainfrom Eq. (3), an equivalent formulation to Eq. (2) (with the added constraint) asmin y ∈{− , } n , v ∈ R d n k Π n y − Xv k such that ( y ⊤ n ) n = α. (5)This is exactly equivalent to a discriminative clustering formulation with the square loss.Following Bach and Harchaoui (2007), we may optimize Eq. (5) in closed form with respectto v as v = ( X ⊤ X ) − X ⊤ y . Substituting v in Eq. (5) leads us tomin y ∈{− , } n n y ⊤ (cid:0) Π n − X ( X ⊤ X ) − X ⊤ (cid:1) y such that ( y ⊤ n ) n = α. (6)This combinatorial optimization problem is NP-hard in general (Karp, 1972; Garey et al.,1976). Hence in practice, it is classical to consider the following convex relaxation of Eq. (6)(Luo et al., 2010). For any admissible y ∈ {− , +1 } n , the matrix Y = yy ⊤ ∈ R n × n is arank-one symmetric positive semi-deﬁnite matrix with unit diagonal entries and converselyany such Y may be written in the form Y = yy ⊤ such that y is admissible for Eq. (6).Moreover by rewriting Eq. (6) asmin y ∈{− , } n n tr yy ⊤ (cid:0) Π n − X ( X ⊤ X ) − X ⊤ (cid:1) such that 1 ⊤ n ( yy ⊤ )1 n n = α, e see that the objective and constraints are linear in the matrix Y = yy ⊤ and Eq. (6) isequivalent to min Y < , rank( Y )=1 diag( Y )=1 n tr Y (cid:0) Π n − X ( X ⊤ X ) − X ⊤ (cid:1) such that 1 ⊤ n Y n n = α. Then dropping the non-convex rank constraint leads us to the following classical convexrelaxation: min Y < , diag( Y )=1 n tr Y (cid:0) Π n − X ( X ⊤ X ) − X ⊤ (cid:1) such that 1 ⊤ n Y n n = α. (7)This is the standard (unregularized) formulation, which is cast as a semi-deﬁnite program.The complexity of interior-point methods is O ( n ), but eﬃcient algorithms in O ( n ) forsuch problems have been developed due to the relationship with the max-cut problem(Journ´ee et al., 2010; Wen et al., 2012).Given the solution Y , one may traditionally obtain a candidate y ∈ {− , } n by running K -means on the largest eigenvector of Y or by sampling (Goemans and Williamson, 1995).In this paper, we show in Section 5 that it may be advantageous to consider the ﬁrst twoeigenvectors. The formulation in Eq. (7) imposes an extra parameter α that characterises the clusterimbalance. It is tempting to ﬁnd a direct relaxation of Eq. (2). It turns out to lead to atrivial relaxation, which we outline below.When optimizing Eq. (2) with respect to w , we obtain the following optimization prob-lem max y ∈{− , } n y ⊤ X ( X ⊤ X ) − X ⊤ yy ⊤ Π n y , leading to a quasi-convex relaxation asmax Y < , diag( Y )=1 tr Y X ( X ⊤ X ) − X ⊤ tr Π n Y , whose solution is found by solving a sequence of convex problems (Boyd and Vandenberghe,2004, Section 4.2.5). As shown in Appendix B, this may be exactly reformulated as a singleconvex problem: max M < , diag( M )=1+ ⊤ M n tr M X ( X ⊤ X ) − X ⊤ . Unfortunately, this relaxation always leads to trivial solutions, and we thus need to considerthe relaxation in Eq. (7) for several values of α = 1 ⊤ n Y n /n (and then the non-convexalgorithm can be run from the rounded solution of the convex problem, using Eq. (2) asa ﬁnal objective). Alternatively, we may solve the following penalized problem for severalvalues of ν >

0: min Y < , diag( Y )=1 n tr Y (cid:0) Π n − X ( X ⊤ X ) − X ⊤ (cid:1) + νn ⊤ n Y n . (8)For ν = 0, Y = 1 n ⊤ n is always a trivial solution. As outlined in our theoretical section andas observed in our experiments, it is suﬃcient to consider ν ∈ [0 , lammarion, Palaniappan and Bach Optimizing Eq. (5) with respect to v in closed form as in Section 2.2 is feasible with noregularizer or with a quadratic regularizer. However, if one needs to add more complexregularizers, we need a diﬀerent relaxation. We start from the penalized version of Eq. (5),min y ∈{− , } n , v ∈ R d n k Π n y − Xv k + ν ( y ⊤ n ) n , (9)which we expand as:min y ∈{− , } n , v ∈ R d n tr Π n yy ⊤ − n tr Xvy ⊤ + 1 n tr X ⊤ Xvv ⊤ + ν ( y ⊤ n ) n , (10)and relax as, using Y = yy ⊤ , P = yv ⊤ and V = vv ⊤ ,min V,P,Y n tr Π n Y − n tr P ⊤ X + 1 n tr X ⊤ XV + ν ⊤ n Y n n s.t. (cid:18) Y PP ⊤ V (cid:19) < , diag( Y ) = 1 . (11)When optimizing Eq. (11) with respect to V and P , we get exactly Eq. (8). Indeed, theoptimum is attained for V = ( X ⊤ X ) − X ⊤ Y X ( X ⊤ X ) − and P = Y X ( X ⊤ X ) − as shownin Appendix C.1. Therefore, the convex relaxation in Eq. (11) is equivalent to Eq. (8).However, we get an interesting behavior when optimizing Eq. (11) with respect to P and Y also in closed form. For ν = 1, we obtain, as shown in Appendix C.2, the followingclosed form expressions: Y = Diag(diag( XV X ⊤ )) − / XV X ⊤ Diag(diag(

XV X ⊤ )) − / P = Diag(diag( XV X ⊤ )) − / XV, leading to the problem:min V < − n n X i =1 q ( XV X ⊤ ) ii + 1 n tr( V X ⊤ X ) . (12)The formulation above in Eq. (12) is interesting for several reasons: (a) it is formulated asan optimization problem in V ∈ R d × d , which will lead to algorithms whose running timewill depend on n linearly (see Section 6), (b) it allows for easy adding of regularizers (seeSection 3), which may be formulated as convex functions of V = vv ⊤ . However, note thatthis is valid only for ν = 1. We now show how to reformulate any problems with ν ∈ [0 , Reformulation for any ν . When ν ∈ [0 , n k Π n y − Xv k + ν ( y ⊤ n ) n = 1 n k Π n y − Xv + ν y ⊤ n n n k − (cid:0) ν y ⊤ n n (cid:1) + ν (cid:0) y ⊤ n n (cid:1) = 1 n k y − Xv − (1 − ν ) y ⊤ n n n k + ν − ν (cid:0) (1 − ν ) y ⊤ n n (cid:1) = min b ∈ R n k y − Xv − b n k + ν − ν b , (13) ince n k y − Xv − b n k + ν − ν b can be optimized in closed form with respect to b as b = (1 − ν ) y ⊤ n n . Note that the weighted imbalance ratio (1 − ν ) y ⊤ n n is made as anoptimization variable in Eq. (13). Thus we have the following reformulationmin v ∈ R d , y ∈{− , } n n k Π n y − Xv k + ν ( y ⊤ n ) n = min v ∈ R d ,b ∈ R , y ∈{− , } n n k y − Xv − b n k + ν − ν b , (14)which is a non-centered penalized formulation on a higher-dimensional problem in the vari-able (cid:0) vb (cid:1) ∈ R d +1 . In the rest of the paper, we will focus on the case ν = 1 as it is simplerto present, noticing that by adding a constant term and a quadratic regularizer, we maytreat the problem with equal ease when ν ∈ [0 ,

3. Regularization

There are several natural possibilities. We consider norms Ω such that Ω( w ) = Γ( ww ⊤ )for a certain convex function Γ; all norms have that form (Bach et al., 2011, Proposition5.1). When ν = 1, Eq. (12) then becomesmax V < n n X i =1 q ( XV X ⊤ ) ii − n tr( V X ⊤ X ) − Γ( V ) . (15)The quadratic regularizers Γ( V ) = tr Λ V have already been tackled by Bach and Harchaoui(2007). They consider the regularized version of problem in Eq. (3)min v ∈ R d n k Π n y − Xv k + v ⊤ Λ v, (16)optimize in closed form with respect to v as v = ( X ⊤ X + n Λ) − X ⊤ y . Substituting v inEq. (16) leads them to min Y < , diag( Y )=1 n tr Y (cid:0) Π n − X ( X ⊤ X + n Λ) − X (cid:1) . In this paper, we formulate a novel sparse regularizer, which is a combination of weightedsquared ℓ -norm and ℓ -norm. It leads toΓ( V ) = tr[Diag( a ) V Diag( a )] + k Diag( c ) V Diag( c ) k , such that Γ( vv ⊤ ) = P di =1 a i v i + (cid:0) P di =1 c i | v i | (cid:1) . This allows to treat all situations simulta-neously, with ν = 1 or with ν ∈ [0 , ν ∈ [0 , d + 1 with a design matrix [ X, n ] ∈ R n × ( d +1) , a directionof projection (cid:0) vb (cid:1) ∈ R d +1 and diﬀerent weights for the last variable with a d +1 = ν − ν and c d +1 = 0.Note that the sparse regularizers on V introduced in this paper are signiﬁcantly diﬀerentwhen compared to the sparse regularizers on variable v in Eq. (3), for example, considered lammarion, Palaniappan and Bach by Wang et al. (2013). A straightforward sparse regularizer on v in Eq. (3), despite leadingto a sparse projection, does not yield natural generalizations of the discriminative clusteringframework in terms of theory or algorithms. However the sparse regularizers considered inthis paper, in addition to their algorithmic appeal for certain applications, also lead torobust cluster recovery under minor assumptions, as will be illustrated on a simple examplein Section 5.

4. Extension to Multiple Labels

The discussion so far has focussed on two clusters. Yet it is key in practice to tackle moreclusters. It is worth noting that the discrete formulations in Eq. (2) and Eq. (5) extenddirectly to more than two clusters. However two diﬀerent extensions of the initial problemsEq. (2) or Eq. (5) are conceivable. They lead to problems with diﬀerent constraints ondiﬀerent optimization domains and, consequently, to diﬀerent relaxations. We discuss thesepossibilities next.One extension is the multi-class case. The multi-class problem which is dealt with byBach and Harchaoui (2007) assumes that the data are clustered into K classes and thevarious partitions of the data points into clusters are represented by the K -class indicatormatrices y ∈ { , } n × K such that y K = 1 n . The constraint y K = 1 n ensures that one datapoint belongs to only one cluster. However as discussed by Bach and Harchaoui (2007),by letting Y = yy ⊤ , it is possible to lift these K -class indicator matrices into the outerconvex approximations C K = { Y ∈ R n × n : Y = Y ⊤ , diag( Y ) = 1 n , Y < , Y K n ⊤ n } (Frieze and Jerrum, 1995), which is diﬀerent for all values of K . Note that letting K = 2corresponds to the previous sections.We now discuss the other possible extension, which is the multi-label case. The multi-label problem assumes that the data share k labels and the data-label membership is rep-resented by matrices y ∈ {− , +1 } n × k . In other words, the multi-class problem embedsthe data in the extreme points of a simplex, while the multi-label problem does so in theextreme points of the hypercube.The discriminative clustering formulation of the multi-label problem ismin v ∈ R d × k , y ∈{− , } n × k n k Π n y − Xv k F , (17)where the Frobenius norm is deﬁned for any vector or rectangular matrix as k A k F =tr AA ⊤ = tr A ⊤ A . Letting k = 1 here corresponds to the previous sections. The dis-crete ensemble of matrices y ∈ {− , +1 } n × k can be naturally lifted into D k = { Y ∈ R n × n : Y = Y ⊤ , diag( Y ) = k n , Y < } , since diag( Y ) = diag( yy ⊤ ) = P ki =1 y i,i = k . As theoptimization problems in Eq. (7) and Eq. (8) have linear objective functions, we can changethe variable from Y to ˜ Y = Y /k to change the constraint diag( Y ) = k n to diag( ˜ Y ) = 1 n without changing the optimizer of the problem. Thus the problems can be solved over therelaxed domain D = { Y ∈ R n × n : Y = Y ⊤ , diag( Y ) = 1 n , Y < } which is independentof k .Note that the domain D is similar to that considered in the problems in Eq. (8) andEq. (11) and these convex relaxations are the same regardless of the value of k . Hencethe multi-label problem is a more natural extension of the discriminative framework, with slight change in how the labels y are recovered from the solution Y (we discuss this inSection 5.3).

5. Theoretical Analysis

In this section, we provide a theoretical analysis for the discriminative clustering framework.We start with the 2-clusters situation: the non-sparse case is considered ﬁrst and analysis isprovided for both balanced and imbalanced clusters. Our study for the sparse case currentlyonly provides results for the simple 1-sparse solution. However, the analysis also yieldsvaluable insights on the scaling between n and d . We then derive results for multi-labelsituation.For ease of analysis, we consider the constrained problem in Eq. (7), the penalized prob-lem in Eq. (8) or their equivalent relaxations in Eq. (12) or Eq. (15) under various scenarios,for which we use the same proof technique. We ﬁrst try to characterize the low-rank solu-tions of these relaxations and then show in certain simple situations the uniqueness of suchsolutions, which are then non-ambiguously found by convex optimization. Perturbationarguments could extend these results by weakening our assumptions but are not within thescope of this paper, and hence we do not investigate them further in this section. clusters: non-sparse problems In this section, we consider several noise models for the problem, either adding irrelevantdimensions or perturbing the label vector with noise. We consider these separately forsimplicity, but they could also be combined (with little extra insight).

We consider an “ideal” design matrix X ∈ R n × d such that there exists a direction v alongwhich the projection Xv is perfectly clustered into two distinct real values c and c . SinceEq. (2) is invariant by aﬃne transformation, we can rotate the design matrix X to have X = [ y, Z ] with y ∈ {− , } n , which is clustered into +1 or − v = (cid:0) d − (cid:1) . Then after being centered, the design matrix is written as X = [Π n y, Z ] with Z = [ z , . . . , z d − ] ∈ R n × ( d − . The columns of Z represent the noisy irrelevant dimensionsadded on top of the signal y . When the problem is well balanced ( y ⊤ n = 0), y is already centered and Π n y = y . Thusthe design matrix is represented as X = [ y, Z ]. We consider here the penalized formulationin Eq. (8) with ν = 1 which is easier to analyze in this setting.Let us assume that the columns ( z i ) i =1 ,...,d − of Z are i.i.d. with symmetric distribution z ,with E z = E z = 0 and such that k z k ∞ is almost surely bounded by R ≥

0. We denote by E z = m its second moment and by E z / ( E z ) = β its (unnormalized) kurtosis.Surprisingly the clustered vector y happens to generate a solution yy ⊤ of the relaxationEq. (8) for all possible values of Z (see Lemma 11 in Appendix D.2 ). However the problemin Eq. (8) should have a unique solution in order to always recover the correct assignment y . Unfortunately the semideﬁnite constraint Y < lammarion, Palaniappan and Bach order information arduous to study. Due to this reason, we consider the other equivalentrelaxation in Eq. (12) for which V ∗ = vv ⊤ is also solution with v ∝ ( X ⊤ X ) − X ⊤ y (seeLemma 12 in Appendix D.3). Fortunately the semideﬁnite constraint V < V of the objective functionalready provides unicity for the unconstrained problem. Hence we are able to ensure theuniqueness of the solution with high probability and the following result provides the ﬁrstguarantee for discriminative clustering. Proposition 2

Let us assume d ≥ , β > and m ≥ β − d + β − :(a) If n ≥ d R d + β ) m m ( β − , V ∗ is the unique solution of the problem in Eq. (12) with highprobability.(b) If n ≥ d R min { m ( β − , m , m } , v is the principal eigenvector of any solution of the problemin Eq. (12) with high probability. Let us make the following observations: − Proof technique : The proof relies on a computation of the Hessian of f ( V ) = n P ni =1 p ( XV X ⊤ ) ii − n tr X ⊤ XV which is the objective function in Eq. (12). Weﬁrst derive the expectation of ∇ f ( V ) with respect to the distribution of X . Bythe law of large number, it amounts to have n going to inﬁnity in ∇ f ( V ). Then weexpand the spectrum of this operator E ∇ f ( V ) to lower-bound its smallest eigenvalue.Finally we use concentration theory on matrices, following Tropp (2012), to boundthe Hessian ∇ f ( V ) for ﬁnite n . − Eﬀect of kurtosis : We remind that β >

1, with equality if and only if z follows aRademacher law ( P ( z = +1) = P ( z = −

1) = 1 / β behaves like a distance of the distribution z to the Rademacher distribution. Moreover, β = 3 if z follows a standard normal distribution. − Scaling between d and n : If the noisy variables are not evenly clustered between thesame clusters {± } (i.e., κ > n = O ( d );while, as long as n = O ( d ), the solution is not unique but its principal eigenvectorrecovers the correct clustering. Moreover, as explained in the proof, its spectrumwould be very spiky. − The assumption m ≥ β − d + β − is generally satisﬁed for large dimensions. Note that m d is the total variance of the irrelevant dimensions, and when it is small, i.e., when m ≤ β − d + β − , the problem is particularly simple, and we can also show that V ∗ isthe unique solution of the problem in Eq. (12) with high probability if n ≥ d R m .Finally, note that for sub-Gaussian distributions (where κ ≤ κ ≥ m . -dimensional balanced problem We assume now that the data are one-dimensional and are perturbed by some noise ε ∈ R n such that X = y + ε with y ∈ {− , } n . The solution of the relaxation in Eq. (8) recovers he correct y in this setting only when each component of y and y + ε have the same sign(this is shown in Appendix D.5). This result comes out naturally from the information onwhether the signs of y and y + ε are the same or not. Further if we assume that y and ε are independent, this condition is equivalent to k ε k ∞ < When the clusters are imbalanced ( y ⊤ n = 0), the natural rank-one candidates Y ∗ = yy ⊤ and V ∗ = vv ⊤ are no longer solutions of the relaxations in Eq. (8) (for ν = 1) and Eq. (12),as proved in Appendix D.6. Nevertheless we are able to characterize some solutions of thepenalized relaxation in Eq. (8) for ν = 0. Lemma 3

For ν = 0 and for any non-negative a, b ∈ R such that a + b = 1 , Y = ayy ⊤ + b n ⊤ n is solution of the penalized relaxation in Eq. (8). Hence any eigenvector of this solution Y would be supported by the directions y and 1 n .Moreover when the value α ∗ = ( ⊤ n yn ) is known, it turns out that we can characterize somesolution of the constrained relaxation in Eq. (7), as stated in the following lemma. Lemma 4

For α ≥ α ∗ , Y = 1 − α − α ∗ yy ⊤ + (cid:16) − − α − α ∗ (cid:17) n ⊤ n is a rank-2 solution of the constrained relaxation in Eq. (7) with constraint parameter α . The eigenvectors of Y enable to recover y for α ∗ ≤ α <

1. We conjecture (and checkedempirically) that this rank-2 solution is unique under similar regimes to those consideredfor the balanced case. The proof would be more involved since, when ν = 1, we are notable to derive an equivalent problem in V for the penalized relaxation in Eq. (8) similar toEq. (12) for the balanced case.Thus Y being rank-2, one should really be careful and consider the ﬁrst two eigenvectorswhen recovering y from a solution Y . This can be done by rounding the principal eigenvectorof Π n Y Π n = − α − α ∗ Π n y (Π n y ) ⊤ as discussed in the following lemma. Lemma 5

Let y ev be the principal eigenvector of Π n Y Π n where Y is deﬁned in Lemma 4,then sign( y ev ) = y. Proof

By deﬁnition of Y , y ev = q − α − α ∗ Π n y thus sign( y ev ) = sign(Π n y ) and since α ≤ n y ) = sign( y − √ α n ) = y .In practice, contrary to the standard procedure, we should, for any ν , solve the penalizedrelaxation in Eq. (8) and then do K -means on the principal eigenvector of the centeredsolution Π n Y Π n instead of the solution Y to recover the correct y . This procedure isfollowed in our experiments on real-world data in Section 7.2. lammarion, Palaniappan and Bach clusters: -sparse problems We assume here that the direction of projection v (such that Xv = y ) is l -sparse (by l -sparsewe mean k v k = l ). The ℓ -norm regularized problem in Eq. (15) is no longer invariantby aﬃne transformation and we cannot consider that X = [ y, Z ] without loss of generality.Yet the relaxation Eq. (15) seems experimentally to only have rank-one solutions for thesimple l = 1 situation. Hence we are able to derive some theoretical analysis only for thiscase. It is worth noting the l = 1 case is simple since it can be solved in O ( d ) by using K -means separately on all dimensions and ranking them. Nonetheless the proposed scalingalso holds in practice for l > X = [ y, Z ] with y ∈ {− , } n and Z ∈ R n × ( d − which areclustered in the direction v = [1 , , . . . , | {z } d − ] ⊤ . When adding a ℓ -penalty, the initial problemin Eq. (5) for ν = 1 is min y ∈{− , } n , v ∈ R d n k y − Xv k + λ k v k . When optimizing in v this problem is close to the Lasso (Tibshirani, 1996) and a solutionis known to be v ∗ i = ( y ⊤ y + nλ ) − y ⊤ y = λ , ∀ i ∈ J and v ∗ i = 0 , ∀ i ∈ { , , . . . , d } \ J , where J is the support of v ∗ . The candidate V ∗ = v ∗ v ∗⊤ is still a solution of the relaxationin Eq. (15) (see Lemma 15 in Appendix E.1) and we will investigate under which conditionson X the solution is unique. Let us assume as before ( z i ) i =1 ,...,d are i.i.d. with distribution z symmetric with E z = E z = 0, and denote by E z = m and E z / ( E z ) = β . We alsoassume that k z k ∞ is almost surely bounded by 0 ≤ R ≤

1. We are able to ensure theuniqueness of the solution with high-probability.

Proposition 6

Let us assume d ≥ .(a) If n ≥ dR d + β ) m m ( β − , V ∗ is the unique solution of the problem Eq. (12) with highprobability.(b) If n ≥ dR m ( β − , v ∗ is the principal eigenvector of any solution of the problem Eq. (12)with high probability. The proof technique is very similar to the one of Proposition 2. With the function g ( V ) = n P ni =1 p ( XV X ⊤ ) ii − λ k V k − n tr X ⊤ XV , we can certify that g will decreasearound the solution V ∗ by analyzing the eigenvalues of its Hessian.The rank-one solution V ∗ is recovered by the principal eigenvector of the solution ofthe relaxation Eq. (15) as long as n = O ( d ). Thus we have a much better scaling whencompared to the non-sparse setting where n = O ( d ). We also conjecture a scaling of order n = O ( ld ) for a projection in a l -sparse direction (see Figure 1b for empirical results).The proposition does not state any particular value for the regularizer parameter λ .This makes sense since the proposition only holds for the simple situation when l = 1. Wepropose to use λ = 1 / √ n by analogy with the Lasso. In this section, the signals share k labels which are corrupted by some extra noisy dimen-sions. We assume the centered design matrix to be X = [Π n y, Z ] where y ∈ {− , +1 } n × k nd Z ∈ R n × ( d − k ) . We also assume that y is full-rank . We denote by y = [ y , . . . , y k ] and α i = (cid:16) y ⊤ i n n (cid:17) for i = 1 , · · · , k . We consider the discrete constrained problemmin v ∈ R d × k , y ∈{− , } n × k n k Π n y − Xv k F such that 1 ⊤ n yy ⊤ n n = α , (18)and the discrete penalized problem for ν = 0min v ∈ R d × k , y ∈{− , } n × k n k Π n y − Xv k F . (19)As explained in Section 4, these two discrete problems admit the same relaxations inEq. (7) and Eq. (8) we have studied for one label. We now investigate when the solutionof the problems in Eq. (18) and in Eq. (19) generate solutions of the relaxations in Eq. (7)and Eq. (8).By analogy with Lemma 3, we want to characterize the solutions of these relaxationswhich are supported by the constant vector 1 n and the labels ( y , . . . , y k ). Their generalform is Y = ˜ yA ˜ y ⊤ where A ∈ R k × k is symmetric semi-deﬁnite positive and ˜ y = [1 n , y ].However the initial y is easily recovered from the solution Y only when A is diagonal. Tothat end the following lemma derives some condition under which the only matrix A suchthat the corresponding Y satisﬁes the constraint of the relaxations in Eq. (7) and Eq. (8)is diagonal. Lemma 7

The solutions of the matrix equation diag(˜ yA ˜ y ⊤ ) = 1 n with unknown variable A are diagonal if and only if the family { n , ( y i ) ≤ i ≤ k , ( y i ⊙ y j ) ≤ i

Lemma 8

Let us assume that the family { n , ( y i ) ≤ i ≤ k , ( y i ⊙ y j ) ≤ i

1. This assumption is fairly reasonable since the probability of a matrix whose entries are i.i.d. Rademacherrandom variables to be singular is conjectured to be 1 / o (1) (Bourgain et al., 2010). lammarion, Palaniappan and Bach In the multi-label case, some combinations of the constant matrix 1 n ⊤ n and the rank-one matrices y i y ⊤ i are solutions of constrained or penalized relaxations. Furthermore, undersome assumptions on the labels ( y i ) ≤ i ≤ k , these combinations are the only solutions whichare supported by the vectors (1 n , y , · · · , y k ). And we conjecture (and checked empirically)that under assumptions similar to those made for the balanced one-label case, all the solu-tions of the relaxation are supported by the family (1 n , y , · · · , y k ) and consequently sharethe same form as in Lemma 8. Thus the eigenvector of the solution Y would be in the spanof the directions (1 n , y , · · · , y k ).Let us consider an eigenvalue decomposition of Y = F F ⊤ = P ki =0 λ i e i e ⊤ i and denote by M = [ a n , a y , · · · , a k y k ] where ( a i ) ≤ i ≤ k are deﬁned in Lemma 8. Since M M ⊤ = F F ⊤ ,there is an orthogonal transformation R such that F R = M . We also denote the product F R by F R = [ ξ , · · · , ξ K ]. We propose now an alternating minimization procedure torecover the labels ( y , · · · , y k ) from M . Lemma 9

Consider the optimization problem min M ∈M , R ∈ R k × k : R ⊤ R = I k k F R − M k F , where M = { [ a n , a y , · · · , a k y k ] , a ∈ R k +1 : k a k = 1 , y i ∈ {± } n } .Given M , the problem is equivalent to the orthogonal Procrustes problem (Sch¨onemann,1966). Denote by U ∆ V ⊤ a singular value decomposition of F ⊤ M . The optimal R isobtained as R = U V ⊤ . While given R , the optimal M is obtained as M = 1 p k ξ k + k ξ k + . . . + k ξ k k [ k ξ k sign( ξ ) , · · · , k ξ k k sign( ξ k )] . Proof

We give only the argument for the optimization problem with respect to M . Given R , the optimization problem in M is equivalent to max a ∈ R k +1 : k a k =1 , y ∈{− , } n × k tr( F R ) ⊤ M andtr( F R ) ⊤ M = a ξ ⊤ n + P ki =1 a i ξ ⊤ i y i . Thus by property of the dual norms the solution isgiven by y i = sign( ξ i ) and a i = k ξ i k √ k ξ k + k ξ k + ... + k ξ k k .The minimization problem in Lemma 9 is non-convex; however we observe that performingfew alternating optimizations is suﬃcient to recover the correct ( y , . . . , y k ) from M . In this section we studied the tightness of convex relaxations under simple scenarios wherethe relaxed problem admits low-rank solutions generated by the solution of the originalnon-convex problem. Unfortunately the solutions lose the characterized rank when theinitial problem is slightly perturbed since the rank of a matrix is not a continuous function.Nevertheless, the spectrum of the new solution is really spiked, and thus these results arequite conservative. We empirically observe that the principal eigenvectors keep recoveringthe correct information outside these scenarios. However this simple proof mechanism is noteasily adaptable to handle perturbed problems in a straightforward way since it is diﬃcultto characterize the properties of eigenvectors of the solution of a semi-deﬁnite program.Hence we are able to derive a proper theoretical study only for these simple models. . Algorithms In this section, we present an optimization algorithm which is adapted to large n settings,and avoids the n -dimensional semideﬁnite constraint. We aim to solve the general regularized problem which correponds to Eq. (15)max V < n n X i =1 q ( XV X ⊤ ) ii − n tr V ( X ⊤ X + n Diag( a ) ) − k Diag( c ) V Diag( c ) k . (20)We consider a slightly diﬀerent optimization problem:max V < n n X i =1 q ( XV X ⊤ ) ii − k Diag( c ) V Diag( c ) k s . t . tr V ( 1 n X ⊤ X + Diag( a ) ) = 1 . (21)When c is equal to zero, then Eq. (21) is exactly equivalent to Eq. (20); when c is small(as will typically be the case in our experiments), the solutions are very similar—in fact,one can show by Lagrangian duality that by a sequence of problems in Eq. (21), one mayobtain the solution to Eq. (20). By letting A = X ⊤ Xn +Diag( a ) , we consider a strongly-convex approximation of Eq. (21) as:max V < n n X i =1 q ( XV X ⊤ ) ii − k Diag( c ) V Diag( c ) k − ε tr[( A V A ) log( A V A )] s . t . tr( A V A ) = 1 , where − tr M log( M ) is a spectral convex function called the von-Neumann entropy (von Neumann,1927). The diﬀerence in the two problems is known to be ε log( d ) (Nesterov, 2007). Asshown in Appendix G.1, the dual problem ismin u ∈ R n + ,C ∈ R d × d : | C ij | c i c j n n X i =1 u i + φ ε (cid:0) A − (cid:0) n X ⊤ Diag( u ) X − C (cid:1) A − (cid:1) , (22)where φ ε ( M ) is an ε -smooth approximation to the maximal eigenvalue of the matrix M . In order to solve Eq. (22), we split the objective function into a smooth part F ( u, C ) = φ ε (cid:0) A − (cid:0) n X ⊤ Diag( u ) X − C (cid:1) A − (cid:1) and a non-smooth part H ( u, C ) = I | C ij | c i c j + n P ni =1 1 u i .We may then apply FISTA (Beck and Teboulle, 2009) updates to the smooth function φ ε ( A − ( n X ⊤ Diag( u ) X − C ) A − ), along with a proximal operator for the non-smoothterms I | C ij | c i c j and n P ni =1 1 u i , which may be computed eﬃciently. See details in Ap-pendix G.2. lammarion, Palaniappan and Bach Running-time complexity.

Since we need to project on the SDP cone of size d at eachiteration, the running-time complexity per iteration is O ( d + d n ); given that often n > d ,the dominating term is O ( d n ). It is still an open problem to make this linear in d . Ourfunction being O (1 /ε )-smooth, the convergence rate is of the form O (1 / ( εt )). Since westop when the duality gap is ε log( d ) (as we use smoothing, it is not useful to go lower), thenumber of iterations is of order 1 / ( ε p log( d )).

7. Experiments

We implemented the proposed algorithm in Matlab. The code has been made availablein https://drive.google.com/uc?export=download&id=0B5Bx9jrp7celMk5pOFI4UGt0ZEk .Two sets of experiments were performed: one on synthetically generated data sets and theother on real-world data sets. The details about experiments follow.

In this section, we illustrate our theoretical results and algorithms on synthetic examples.The synthetic data were generated by assuming a ﬁxed clustering with α ∗ ∈ [0 , y as 1 − (¯ y ⊤ y/n ) , with values in [0 ,

1] and equal to zero if and onlyif y = ¯ y . Phase transition.

We ﬁrst illustrate our theoretical results for the balanced case inFigure 1. We solve the relaxation for a large range of d and n using the cvx solver(Grant and Boyd, 2008, 2014). We show the results averaged over 4 replications and take λ n = 1 / √ n for the sparse problems. In Figure 1a we investigate whether cvx ﬁnds a rank-one solution for a problem of size ( n, d ) (the value is 1 if the solution is rank-one and 0otherwise). We compare the performance of the algorithms without ℓ -regularization in theaﬃne invariant case and with ℓ -regularization in the 1-sparse case. We observe a phasetransition with a scaling over the form n = O ( d ) for the aﬃne invariant case and n = O ( d )for the 1-sparse case. This is better than what expected by the theory and correspondsrather to the performance of the principal eigenvector of the solution. It is worth notingthat it may be uncertain to really distinguish between a rank-one solution and a spikedsolution.We also solve the relaxation for 4-sparse problems of diﬀerent sizes d and n and plot theclustering error. We compare the performance of the algorithms without ℓ -regularizationin the aﬃne invariant case and with ℓ -regularization in the 4-sparse case in Figure 1b. Wenotice a phase transition of the clustering error with a scaling over the form n = O ( d ) forthe aﬃne invariant case and n = O ( d ) for the 4-sparse case. It supports our conjecture onthe scaling of order n = O ( ld ) for l -sparse problems. Comparing left plots of Figure 1a andFigure 1b, we observe that the two phase-transitions occur at the same scaling between n and d . Thus there are few values of ( n, d ) for which the cvx solver ﬁnds a solution whoserank is stricly larger than one and whose principal eigenvector has a low clustering error.This illustrates, in practice, this solver aims to ﬁnd a rank-one solution under the improvedscaling n = O ( d ). oise dimension d D a t a s i z e n

10 20 3050100150 00.51n=c d^2 Noise dimension d D a t a s i z e n (a) Phase transition for rank-one solution. Left: aﬃne invariant case. Right: 1-sparse case. Noise dimension d D a t a s i z e n

10 20 30 40 50100200300400 00.51n=c d Noise dimension d D a t a s i z e n

10 20 30 40 50100200300400 00.51n=c d (b) Phase transition for clustering error. Left: aﬃne invariant case . Right: 4-sparse case.

Figure 1: Phase transition plots. lammarion, Palaniappan and Bach α C l u s t e r i ng E rr o r First eig.Second eig.Proj. 0 0.5 10510 α R an k o f t he s o l u t i on σ =0 σ =0.1 σ =0.3 0 0.5 100.51 µ C l u s t e r i ng E rr o r First eig.Second eig.Proj.

Figure 2: Unbalanced problem for n = 80, d = 10 and α ∗ = 0 .

25. Left: Clustering errorfor the constrained relaxation. Middle: Rank of the solution for diﬀerent level of noise σ .Right: Clustering error for the penalized relaxation. Unbalanced case.

We generate an unbalanced problem for d = 10, n = 80 and α ∗ = 0 . n Y (2) Π n ) (1) where Y ( k ) isthe best rank- k approximation of Y , to extract the information of y . We see in Figure 2that (a) for the constrained case, the range of α such that the sign of y is recovered is cutin two parts where one eigenvector is correct, whereas the projection method performs wellon the whole set. (b) For the penalized case, the correct sign is recovered for ν close to 0by the ﬁrst eigenvector and the projection method whereas the second one performs alwaysbadly. (c) When there is zero noise the rank of the solution is one for α ∈ { α ∗ , } , two for α ∈ ( α ∗ ,

1) and greater otherwise. These ﬁndings conﬁrm our analysis. However, when y iscorrupted by some noise this result is no longer true. Runtime experiments.

We also generated data with a k -sparse direction of projection v by adding d − k noise variables to a randomly generated and rotated k -dimension data. Theproposed optimization problem implemented using FISTA (Beck and Teboulle, 2009) wascompared against a benchmark cvx solver to compare its scalability. Experiments wereperformed for λ = 0 and λ = 0 . k V k term.For a ﬁxed d , cvx breaks down for large n values (typically n > cvx is generally high for λ = 0 and is comparable to our method for λ = 0 . λ = 0, the problem reduces to the original Diﬀrac problem (Bach and Harchaoui,2007) and hence can be compared to an equivalent max-cut SDP (Boumal et al., 2014). Weobserved that our method is comparable in terms of runtime and clustering performanceof low-rank methods for max-cut (Figure 3). However, for λ >

0, the equivalence withmax-cut disappears.The plots in these ﬁgures show the behavior of FISTA for two diﬀerent stopping criteria: ε = 10 − / log( d ) and ε = 10 − / log( d ). It is observed that the choice 10 − / log( d ) gives abetter accurate solution at the cost of more number of iterations (and hence higher runtime).For sparse problems in Figure 3b, we see that cvx gets a better clustering performance (whilecrashing for large n ); the diﬀerence would be reduced with a smaller duality gap for FISTA.

500 1000 1500 200000.20.40.60.81 Num samples n A v e r age C l u s t e r i ng E rr o r cvxfista (0.01/log(d))fista (0.001/log(d))maxcut 0 500 1000 1500 2000 250010 Num samples n A v e r age C P U T i m e ( s e c ) cvxfista (0.01/log(d))fista (0.001/log(d))maxcut0 20 40 60 80 10000.20.40.60.81 Dimension d A v e r age C l u s t e r i ng E rr o r cvxfista (0.01/log(d))fista (0.001/log(d))maxcut 0 20 40 60 80 10010 −2 Dimension d A v e r age C P U T i m e ( s e c ) cvxfista (0.01/log(d))fista (0.001/log(d))maxcut (a) cvx , max-cut comparison with λ = 0. Top: n varied with d = 50, k = 6. cvx crashed for n ≈ d varied with n = 100, k = 2. A v e r age C l u s t e r i ng E rr o r cvxfista (0.01/log(d))fista (0.001/log(d)) 0 500 1000 1500 200010 Num samples n A v e r age C P U T i m e ( s e c ) cvxfista (0.01/log(d))fista (0.001/log(d))0 50 100 15000.20.40.60.81 Dimension d A v e r age C l u s t e r i ng E rr o r cvxfista (0.01/log(d))fista (0.001/log(d)) 0 50 100 15010 −2 Dimension d A v e r age C P U T i m e ( s e c ) cvxfista (0.01/log(d))fista (0.001/log(d)) (b) cvx comparison with λ = 0 . n varied with d = 50, k = 6. cvx crashed for n ≈ d varied with n = 100, k = 2. Figure 3: Scalability experiments. lammarion, Palaniappan and Bach Clustering performance.

Experiments comparing the proposed method with K -meansand alternating optimization are given in Figure 4. K -means is run on the whitened variablesin R d . Alternating optimization is another popular method Ye et al. (2008) for dimension-ality reduction with clustering (where alternating optimization of w and y is performed tosolve the non-convex formulation (2)). The plots show that both K -means and alternatingoptimization fail when only a few dimensions of noise variables are present. The plots alsoshow that with the introduction of a sparse regularizer (corresponding to the non-zero λ )the proposed method becomes more robust to noisy dimensions. As observed earlier, theperformance of FISTA is also sensitive to the choice of ε . A v e r age C l u s t e r i ng E rr o r fista (0.01/log(d))fista (0.001/log(d))k−meansalt−opt (a) λ = 0 A v e r age C l u s t e r i ng E rr o r fista (0.01/log(d))fista (0.001/log(d))k−meansalt−opt (b) λ = 0 . A v e r age C l u s t e r i ng E rr o r fista (0.01/log(d))fista (0.001/log(d))k−meansalt−opt (c) λ = 0 . Figure 4: Comparison with k -means and alternating optimization Experiments were conducted on real two-class clas-siﬁcation datasets to compare the performance of sparse discriminative clustering againstnon-sparse discriminative clustering, alternating optimization and K -means algorithms. Forthe two-class datasets, the clustering performance for a cluster ¯ y ∈ { +1 , − } n obtained froman algorithm under comparison, was computed as 1 − (¯ y ⊤ y/n ) , where y is the original la-

2. The data sets were obtained from eling. Here we explicitly compare the output of clustering with the original labels of thedata points.The dataset details and clustering performance results are summarized in Table 1.The experiments for discriminative clustering were conducted for diﬀerent values of a, c ∈{ − , − , − } associated with the ℓ -regularizer and ℓ -regularizer respectively. Therange of cluster imbalance parameter was chosen to be ν ∈ { . , . , . , . , } . Theresults given in Table 1 pertain to the best choices of these parameters. The results foralternating optimization and K -means show the average cluster error (and standard devia-tion) over 10 diﬀerent runs. These results show that the cluster error is quite high for manydatasets. This is primarily due to the absence of an ambient low-dimensional clusteringof the two-class data, which can be identiﬁed by the simple linear model presented in thispaper. The results also show that adding sparse regularizers to discriminative clusteringhelps in a better cluster identiﬁcation when compared to the non-sparse case and the otheralgorithms like alternating optimization and K -means.Table 1: Experiments on two-class datasets Dataset n d

Cluster ErrorSparse Non-sparse Alternating K -meansDiscriminative Discriminative OptimizationClustering ClusteringHeart 270 3 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± lammarion, Palaniappan and Bach T r ( Φ ( Y t r u e ) * Φ ( Y k )) alt−mink−means Figure 5: Plot of Tr(Φ Y true Φ Y k ). Experiments on real multi-label data.

Experiments were also conducted on the Mi-crosoft COCO dataset to demonstrate the eﬀectiveness of the proposed method in discov-ering multiple labels. We considered n = 2000 images from the dataset, each of which waslabeled with a subset of K = 80 labels. The labels identiﬁed the objects in the images likeperson, car, chair, table, etc. and the corresponding features for each image were extractedfrom the last layer of a conventional convolutional neural network (CNN). The CNN wasoriginally trained over the imagenet data (Krizhevsky et al., 2012).For each image in the dataset, we obtained d = 1000 features. We then performeddiscriminative clustering on the 2000 × X and obtained the label matrix Y which was then subjected to the alternating optimization procedure (see Section 5.3).It is clearly unlikely to recover perfect labels; therefore we now describe a way of mea-suring the amount of information which is recovered. In order to extract meaningful clusterinformation from the result so-obtained, we computed the correlation matrix Y k Π n Y true where Y true is the n × K label matrix containing actual labels and Π n is the n × n center-ing matrix I n − n n ⊤ n . The k predicted labels are present in the Y k matrix. In order tochoose an appropriate value of k , we plotted Tr(Φ Y true Φ Y k ) (shown in Figure 5 along witha K -means baseline), where Φ Y k = Y k ( Y k ⊤ Y k ) − Y ⊤ k . From these plots, we chose k = 30 tobe a suitable value for our interpretation purposes.After choosing an arbitrary value of k = 30, we plotted the correlations between theactual and predicted labels. The heatmap of the normalized absolute correlations is given inFigure 6, where the columns and rows corresponding to the 80 true labels and 30 predictedlabels respectively, are ordered according to the sum of squared correlations (the top-scoringlabels appear to the left-bottom). From this plot, we extract following highly correlatedlabels: person, dining table, car, chair, cup, tennis racket, bowl, truck, fork, pizza, showingthat these labels were partially recovered by our unsupervised technique (note that theCNN features are learned with supervision on the diﬀerent dataset Imagenet, hence thereis still some partial supervision).

3. Dataset obtained from http://mscoco.org/dataset rue labels P r ed i c t ed l abe l s Figure 6: Heatmap of correlations, Y k Π n Y true with k = 30, with columns and rows orderedaccording to the sum of squared correlations.

8. Conclusion

In this paper, we provided a sparse extension of the discriminative clustering framework,and gave a ﬁrst analysis of its theoretical performance in the totally unsupervised situation,highlighting provable scalings between ambient dimension d , number of observations and“clusterability” of irrelevant variables. We also proposed an eﬃcient algorithm which is theﬁrst of its kind to be linear in the number of observations. Our work could be extendedin a number of ways, e.g., extending the sparse analysis to l -sparse case with higher l ,considering related weakly supervised learning extensions (Joulin and Bach, 2012), goingbeyond uniqueness of rank-one solutions, and improving the complexity of our algorithmto O ( nd ), for example using stochastic gradient techniques. lammarion, Palaniappan and Bach References

D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In

Proceed-ings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms , 2007.F. Bach and Z. Harchaoui. DIFFRAC : a discriminative and ﬂexible framework for cluster-ing. In

Adv. NIPS , 2007.F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with Sparsity-InducingPenalties.

Foundations and Trends in Machine Learning , 4(1):1–106, 2011.A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for LinearInverse Problems.

SIAM J. Img. Sci. , 2009.R. Bellman. A note on cluster analysis and dynamic programming.

Mathematical Bio-sciences , 1973.G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.-R. M¨uller. In search ofnon-Gaussian components of a high-dimensional distribution.

The Journal of MachineLearning Research , 7:247–282, 2006.N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab Toolbox forOptimization on Manifolds.

Journal of Machine Learning Research , 2014.J. Bourgain, V. H. Vu, and P. M. Wood. On the singularity probability of discrete randommatrices.

Journal of Functional Analysis , 258(2):559–603, 2010.S. P. Boyd and L. Vandenberghe.

Convex Optimization . Cambridge University Press, 2004.F. De la Torre and T. Kanade. Discriminative cluster analysis. In

Proc. ICML , 2006.E. Diederichs, A. Juditsky, A. Nemirovski, and V. Spokoiny. Sparse non-Gaussian compo-nent analysis by semideﬁnite programming.

Machine learning , 91(2):211–238, 2013.C. Ding and T. Li. Adaptive dimension reduction using discriminant analysis and K-meansclustering. In

Proc. ICML , 2007.D. Freedman.

Statistical models: theory and practice . Cambridge University Press, 2009.J. H. Friedman and W. Stuetzle. Projection pursuit regression.

Journal of the Americanstatistical Association , 76(376):817–823, 1981.A. Frieze and M. Jerrum. Improved approximation algorithms for MAX k-CUT and MAXBISECTION. In

Integer Programming and Combinatorial Optimization . Springer, 1995.M. R. Garey, D. S. Johnson, and L. Stockmeyer. Some simpliﬁed NP-complete graphproblems.

Theoret. Comput. Sci. , 1(3):237–267, 1976.M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for MaximumCut and Satisﬁability Problems Using Semideﬁnite Programming.

J. ACM , 42(6):1115–1145, November 1995. . C. Gower and G. J. S. Ross. Minimum spanning trees and single Linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics) , 18(1), 1969.M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blon-del, S. Boyd, and H. Kimura, editors,

Recent Advances in Learning and Control , LectureNotes in Control and Information Sciences, pages 95–110. Springer-Verlag Limited, 2008.M. Grant and S. Boyd. CVX: Matlab Software for Disciplined Convex Programming, version2.1, March 2014.A. Hyv¨arinen, J. Karhunen, and E. Oja.

Independent component analysis , volume 46. JohnWiley & Sons, 2004.A. Joulin and F. Bach. A convex relaxation for weakly supervised classiﬁers. In

Proc.ICML , 2012.A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In

Proc. CVPR , 2010a.A. Joulin, J. Ponce, and F. Bach. Eﬃcient optimization for discriminative latent classmodels. In

Adv. NIPS , 2010b.M. Journ´ee, F. Bach, P-A Absil, and R. Sepulchre. Low-rank optimization on the cone ofpositive semideﬁnite matrices.

SIAM Journal on Optimization , 2010.R. M. Karp. Reducibility among combinatorial problems. In

Complexity of computercomputations , pages 85–103. Plenum, New York, 1972.A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classiﬁcation with Deep Convo-lutional Neural Networks. In

Adv. NIPS , 2012.N. Le Roux and F. Bach. Local component analysis. In

Proceedings of the InternationalConference on Learning Representations , 2013.Z. Q. Luo, W. K. Ma, A. C. So, Y. Ye, and S. Zhang. Semideﬁnite relaxation of quadraticoptimization problems.

Signal Processing Magazine, IEEE , 2010.J. B. MacQueen. Some Methods for Classiﬁcation and Analysis of MultiVariate Observa-tions. In

Proc. of the ﬁfth Berkeley Symposium on Mathematical Statistics and Probability ,volume 1, pages 281–297. University of California Press, 1967.Y. Nesterov. Smoothing technique and its applications in semideﬁnite optimization.

Math.Program. , 2007.A. Y. Ng, M. I. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and an algorithm.In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors,

Adv. NIPS . 2002.P. H Sch¨onemann. A generalized solution of the orthogonal Procrustes problem.

Psychome-trika , 31(1), 1966. lammarion, Palaniappan and Bach R. Tibshirani. Regression shrinkage and selection via the Lasso.

J. Roy. Statist. Soc. Ser.B , 1996.J. Tropp. User-friendly tail bounds for sums of random matrices.

Found. Comput. Math. ,2012.J. von Neumann. Thermodynamik quantummechanischer Gesamheiten.

G¨ott. Nach , (1):273–291, 1927.F. Wang, B. Zhao, and C. Zhang. Linear time maximum margin clustering.

IEEE Trans-actions on Neural Networks , 2010.H. Wang, F. Nie, and H. Huang. Multi-View Clustering and Feature Learning via StructuredSparsity. In

Proc. ICML , volume 28, 2013.Z. Wen, D. Goldfarb, and K. Scheinberg. Block coordinate descent methods for semidef-inite programming. In

Handbook on Semideﬁnite, Conic and Polynomial Optimization .Springer, 2012.L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In

Adv.NIPS , 2004.J. Ye, Z. Zhao, and M. Wu. Discriminative k-means for clustering. In

Adv. NIPS , 2008.K. Zhang, I. W. Tsang, and J. T. Kwok. Maximum margin clustering made practical.

IEEETransactions on Neural Networks , 2009. ppendix A. Joint clustering and dimension reduction Given y , we need to optimize the Rayleigh quotient w ⊤ X ⊤ yy ⊤ Xww ⊤ X ⊤ Xw with a rank-one matrixin the numerator, which leads to w = ( X ⊤ X ) − X ⊤ y . Given w , we will show that theaveraged distortion measure of K -means once the means have been optimized is exactlyequal to ( y ⊤ Π n Xw ) / k Π n y k . Given the data matrix X ∈ R n × d , K -means to cluster thedata into two components will tend to approximate the data points in X by the centroids c + ∈ R d and c − ∈ R d such that X ≈ ( y + 1 n )2 c ⊤ + − ( y − n )2 c ⊤− (since y ∈ {− , } n )= y c ⊤ + − c ⊤− ) + 12 1 n ( c ⊤ + + c ⊤− ) . The objective of K -means can now be written as problem KM :min y,c + ,c − (cid:13)(cid:13)(cid:13)(cid:13) X − y c ⊤ + − c ⊤− ) −

12 1 n ( c ⊤ + + c ⊤− ) (cid:13)(cid:13)(cid:13)(cid:13) F = min y,c + ,c − (cid:13)(cid:13)(cid:13)(cid:13) X − ( y + 1 n )2 c ⊤ + − (1 n − y )2 c ⊤− (cid:13)(cid:13)(cid:13)(cid:13) F = min y,c + ,c − k X k F + k c ⊤ + k F (cid:13)(cid:13)(cid:13)(cid:13) ( y + 1 n )2 (cid:13)(cid:13)(cid:13)(cid:13) + k c ⊤− k F (cid:13)(cid:13)(cid:13)(cid:13) (1 − y n )2 (cid:13)(cid:13)(cid:13)(cid:13) + 2 c ⊤− c + ( y + 1 n )2 ⊤ (1 n − y )2 − X ⊤ (cid:18) ( y + 1 n )2 c ⊤ + + (1 n − y )2 c ⊤− (cid:19) = min y,c + ,c − k X k F + k c ⊤ + k F

12 ( n + 1 ⊤ n y ) + k c ⊤− k F

12 ( n − ⊤ n y ) − c ⊤ + X ⊤ (cid:18) y + 1 n (cid:19) − c ⊤− X ⊤ (cid:18) n − y (cid:19) . Fixing y and minimizing with respect to c + and c − , we get closed-form expressions for c + and c − as c + = X ⊤ ( y + 1 n )( n + 1 ⊤ n y ) and c − = X ⊤ (1 n − y )( n − ⊤ n y ) . lammarion, Palaniappan and Bach Substituting these expressions in KM , we have the following optimization problem in y :min y k X k F − k X ⊤ ( y + 1 n ) k F ( n + 1 ⊤ n y ) − k X ⊤ (1 n − y ) k F ( n − ⊤ n y )= min y k X k F −

12 tr XX ⊤ ( y + 1 n )( y + 1 n ) ⊤ ( n + 1 ⊤ n y ) −

12 tr XX ⊤ (1 n − y )(1 n − y ) ⊤ ( n − ⊤ n y )= min y k X k F − n + 1 ⊤ n y ) tr XX ⊤ (cid:18) y + 1 n (cid:19)(cid:18) y + 1 n (cid:19) ⊤ − n − ⊤ n y ) tr XX ⊤ (cid:18) n − y (cid:19)(cid:18) n − y (cid:19) ⊤ = min y tr XX ⊤ − n + 1 ⊤ n y ) tr XX ⊤ (cid:18) y + 1 n (cid:19)(cid:18) y + 1 n (cid:19) ⊤ − n − ⊤ n y ) tr XX ⊤ (cid:18) n − y (cid:19)(cid:18) n − y (cid:19) ⊤ = min y tr XX ⊤ (cid:18) I − n + 1 ⊤ n y ) ( yy ⊤ + 1 n ⊤ n + y ⊤ n + 1 n y ⊤ ) − n − ⊤ n y ) (1 n ⊤ n + yy ⊤ − n y ⊤ − y ⊤ n ) (cid:19) . By the centering of X , we have 1 ⊤ n X = 0 and hence tr XX ⊤ n ⊤ n = tr XX ⊤ n y ⊤ =tr XX ⊤ y ⊤ n = 0. Therefore, we obtainmin y tr XX ⊤ (cid:18) I − n + 1 ⊤ n y ) ( yy ⊤ ) − n − ⊤ n y ) ( yy ⊤ ) (cid:19) = min y tr XX ⊤ (cid:18) I − ( yy ⊤ ) (cid:18) n + 1 ⊤ n y ) + 12( n − ⊤ n y ) (cid:19)(cid:19) = min y tr XX ⊤ (cid:18) I − ( yy ⊤ ) (cid:18) nn − (1 ⊤ n y ) ) (cid:19)(cid:19) = min y tr XX ⊤ (cid:18) I − nyy ⊤ n − (1 ⊤ n y ) (cid:19) . Thus we have the equivalent K -means problem asmin y ∈{− , } n n tr Xww ⊤ X ⊤ (cid:16) I − nn − ( y ⊤ yy ⊤ (cid:17) = 1 − max y ∈{− , } n ( w ⊤ X ⊤ y ) n − ( y ⊤ . Thus the averaged distortion measure of K -means with the optimized means is ( y ⊤ Π n Xw ) k Π n y k . Appendix B. Full (unsuccessful) relaxation

It is tempting to ﬁnd a direct relaxation of Eq. (2). It turns out to lead to a trivial relaxation,which we outline in this section. When optimizing Eq. (2) with respect to w , we obtainmax y ∈{− , } n y ⊤ X ( X ⊤ X ) − X ⊤ yy ⊤ Π n y , leading to a quasi-convex relaxation as max Y < , diag( Y )=1 tr Y X ( X ⊤ X ) − X ⊤ tr Π n Y . Unfortunately, this relaxation always leads to trivial solutions as described below. onsider the quasi-convex relaxationmax Y < , diag( Y )=1 tr Y X ( X ⊤ X ) − X ⊤ tr Π n Y . (23)By deﬁnition of Π n this relaxation is equal to:max Y < , diag( Y )=1 n tr Y X ( X ⊤ X ) − X ⊤ − ⊤ n Y n n . Let A = { Y < , diag( Y ) = 1 } the feasible set of this problem and deﬁne B = { M < , diag( M ) = 1 + ⊤ n M n n } . Let Y ∈ A , then M deﬁned by M = Y − ⊤ n Y nn belongs to B since1 + ⊤ n M n n = 1 + ⊤ n Y n n − ⊤ n Y n = − ⊤ n Y nn = diag( M ). Reciprocally for M ∈ B , we can deﬁne Y = M ⊤ n M nn , such that diag( Y ) = 1 and Y ∈ A and then verify that M = Y − ⊤ n Y nn . Thusthe problem Eq. (23) is equivalent to the relaxationmax M < , diag( M )=1+ ⊤ n M nn n tr M X ( X ⊤ X ) − X ⊤ . (24)The Lagrangian function of this problem can be written as: L ( µ ) = tr M X ( X ⊤ X ) − X ⊤ − µn ⊤ [diag( M ) − n − ⊤ n M n n n ]= tr M [ X ( X ⊤ X ) − X ⊤ − Diag( µ ) + 1 ⊤ n µn n ⊤ n ] + 1 n µ ⊤ n . Using L ( µ ) and the PSD constraint M <

0, the dual problem is given bymin µ µ ⊤ n n s.t. Diag( µ ) − ⊤ n µn n ⊤ n < X ( X ⊤ X ) − X ⊤ . Since X ( X ⊤ X ) − X ⊤ <

0, this implies for the dual variable µ :Diag( µ ) − ⊤ n µn n ⊤ n < ⇔ ⊤ n Diag( µ ) − n ≤ n µ ⊤ n ⇔ n X i =1 µ i ≤ n P ni =1 µ i ⇔ n n X i =1 µ i ≤ n P ni =1 µ i . However for ν ∈ R n , the harmonic mean (cid:2) n P ni =1 1 ν i (cid:3) − is always smaller than the arithmeticmean n P ni =1 ν i with equality if and only if ν = c n for c ∈ R .Thus the dual variable µ is constant and the diagonal constraint simpliﬁes itself as atrace constraint. Therefore the problem is equivalent to the trivial relaxation whose eacheigenvector of X ( X ⊤ X ) − X ⊤ is solutionmax M < , tr( M )= n + ⊤ n M nn tr M X ( X ⊤ X ) − X ⊤ . lammarion, Palaniappan and Bach Appendix C. Equivalent relaxation

C.1 First equivalent relaxation

We start from the penalized version of Eq. (5),min y ∈{− , } n , v ∈ R d n k Π n y − Xv k + ν ( y ⊤ n ) n , (25)which we expand as:min y ∈{− , } n , v ∈ R d n tr Π n yy ⊤ − n tr Xvy ⊤ + 1 n tr X ⊤ Xvv ⊤ + ν ( y ⊤ n ) n , (26)and relax as, using Y = yy ⊤ , P = yv ⊤ and V = vv ⊤ ,min V,P,Y n tr Π n Y − n tr P ⊤ X + 1 n tr X ⊤ XV + ν ⊤ n Y n n s.t. (cid:18) Y PP ⊤ V (cid:19) < , diag( Y ) = 1 . (27)When optimizing Eq. (27) with respect to V and P , we get exactly Eq. (8). Indeed wesolve this problem by ﬁxing the matrix Y such that Y = Y and diag( Y ) = 1 n . Then theLagrangian function of the problem in Eq. (27) can be written as L ( A ) = 1 n tr Π n Y − n tr P ⊤ X + 1 n tr X ⊤ XV + ν ⊤ n Y n n + tr A ( Y − Y )= (cid:18) Y PP ⊤ V (cid:19) (cid:18) n Π n + νn n ⊤ n + A − n X − n X ⊤ n X ⊤ X (cid:19) − tr AY . Using L ( A ) and the psd constraint (cid:18) Y PP ⊤ V (cid:19) <

0, we write the dual problem asmin A tr AY s.t. (cid:18) n Π n + νn n ⊤ n + A − n X − n X ⊤ n X ⊤ X (cid:19) < . From the Schur’s complement condition of (cid:18) n Π n + νn n ⊤ n + A − n X − n X ⊤ n X ⊤ X (cid:19) <

0, we obtain n Π n + νn n ⊤ n + A < n X ( X ⊤ X ) − X ⊤ . Substituting the bound for A we get the optimalobjective function value D ∗ = 1 n tr X ( X ⊤ X ) − X ⊤ Y − n tr Π n Y − νn ⊤ n Y n . Note that the optimal dual objective value D ∗ corresponds to a ﬁxed Y . Hence by max-imizing with respect to Y we obtain exactly Eq. (8) and therefore, the convex relaxationin Eq. (11) is equivalent to Eq. (8). Moreover the Karush-Kuhn-Tucker (KKT) conditionsgives P ⊤ − X + V X ⊤ X = 0 and − Y X + P X ⊤ X = 0Thus the optimum is attained for P = Y X ( X ⊤ X ) − and V = ( X ⊤ X ) − X ⊤ Y X ( X ⊤ X ) − . .2 Second equivalent relaxation For ν = 1, we solve the problem in Eq. (27) by ﬁxing the matrix V = V . Then theLagrangian function of this problem can be written asˆ L ( µ, B ) = 1 n tr Π n Y − n tr P ⊤ X + 1 n tr X ⊤ XV + ν ⊤ n Y n n + µ ⊤ (diag( Y ) − n ) + tr B ( V − V )= (cid:18) Y PP ⊤ V (cid:19) (cid:18) n I n + diag( µ ) − n X − n X ⊤ n X ⊤ X + B (cid:19) − µ ⊤ n − tr BV . Using ˆ L ( µ, B ) and the psd constraint (cid:18) Y PP ⊤ V (cid:19) <

0, the dual problem is given bymin µ,B µ ⊤ n + tr BV s.t. (cid:18) n I n + diag( µ ) − n X − n X ⊤ n X ⊤ X + B (cid:19) < . From the Schur’s complement condition of (cid:18) n I n + diag( µ ) − n X − n X ⊤ n X ⊤ X + B (cid:19) <

0, we obtain B < n X ⊤ diag( µ + 1 n /n ) − X − n X ⊤ X . Substituting the bound for B we get the dualproblem as min µ µ ⊤ n + 1 n tr V X ⊤ diag( µ + 1 n /n ) − X − n tr V X ⊤ X min µ n X i =1 µ i + 1 n µ i + n x ⊤ i V x i ! − n tr V X ⊤ X. Solving for µ i , we get µ ∗ i = 1 n q x ⊤ i V x i − n . Substituting µ ∗ i into the dual obkective function, we get the optimal objective function valueˆ D = 2 n n X i =1 q ( XV X ⊤ ) ii − − n tr V X ⊤ X. Furthermore the KKT conditions gives Y diag( ν + 1 n /n ) − n P X ⊤ = 0 and P ⊤ diag( ν + 1 n /n ) − n V X ⊤ = 0 . Thus we obtain the following closed form expressions: P = Diag(diag( XV X ⊤ )) − / XVY = Diag(diag(

XV X ⊤ )) − / XV X ⊤ Diag(diag(

XV X ⊤ )) − / . The optimal dual objective value ˆ D corresponds to a ﬁxed V . Therefore, maximizing withrespect to V leads to the problem:min V < − n n X i =1 q ( XV X ⊤ ) ii + 1 n tr( V X ⊤ X ) . (28) lammarion, Palaniappan and Bach Appendix D. Auxilliary results for Section 5.1

D.1 Auxilliary lemma

The matrix X ( X ⊤ X ) − X ⊤ has the following properties (see e.g. (Freedman, 2009)). Lemma 10

The matrix H = X ( X ⊤ X ) − X ⊤ is the orthogonal projection onto the columnspace of the design matrix X since: − H is symmetric. − H is idempotent ( H ) = H . − X is invariant under H , that is HX = X . D.2 Rank-one solution of the relaxation Eq. (8)

We denote by ( x i ) i =1 ...n the lines of X . Lemma 11

The rank-one solution Y ∗ = yy ⊤ is always solution of the relaxation Eq. (8). Proof

We give an elementary proof of this result without using convex optimization tools.Using lemma 10 we have Hy = y , thustr HY ∗ = tr Hyy ⊤ = tr yy ⊤ = n. Moreover all M < P ni =1 λ i u i u ⊤ i with λ i ≥ u i ) i =1 ,...,n an orthonormal familly. Since H is an orthogonal projection ( u i ) ⊤ Hu i = ( Hu i ) ⊤ Hu i = k Hu i k ≤ k u i k ≤

1. Thus tr HM = P ni =1 λ i tr Hu i ( u i ) ⊤ = P ni =1 λ i ( u i ) ⊤ Hu i ≤ P ni =1 λ i =tr M .Then for all matrix M feasible we have tr HM ≤ n since diag( M ) = 1 n and tr HY ∗ = n which conclude the lemma. D.3 Rank-one solution of the relaxation Eq. (12)Lemma 12

The rank-one solution V ∗ = vv ⊤ is always solution of the relaxation Eq. (12). Proof

The Karush-Kuhn-Tucker (KKT) optimality conditions for the problem are for thedual variable A n n X i =1 x i x ⊤ i q x ⊤ i V x i − n XX ⊤ = A and AV = 0 (Complementary Slackness) . Since x ⊤ i w = y i , q x ⊤ i V ∗ x i = | y i | = 1, V ∗ and the dual variable A = 0 satisfy the KKTconditions and then V ∗ is solution of this problem. .4 Proof of Proposition 2 In the following lemma, we use a Taylor expansion to lower-bound f around its minimum. Lemma 13

For d ≥ and δ ∈ [0 , .If β ≥ and m ≤ β − d + β − , then with probability at least − d exp (cid:0) − δ nm R d (cid:1) , for anysymmetric matrix ∆ : f ( V ∗ ) − f ( V ∗ + ∆) > − δ ) m k ∆ k F + o ( k ∆ k ) ≥ . Otherwise with probability at least − d exp (cid:0) − δ nµ R d (cid:1) , for any symmetric matrix ∆ : f ( V ∗ ) − f ( V ∗ + ∆) > (1 − δ ) µ k ∆ k F + o ( k ∆ k ) ≥ , with µ ≥ m ( β − d + β − m . Moreover we also have with probability at least − d exp (cid:0) − δ nµ R d (cid:1) ,for any symmetric matrix ∆ ∈ ∆ ⊥ min : f ( V ∗ ) − f ( V ∗ + ∆) > (1 − δ ) µ k ∆ k F + o ( k ∆ k ) ≥ , where µ = min { m , m ( β − , m } and ∆ min = (cid:18) c min I d − (cid:19) is deﬁned in the proofand satisﬁes | c min | ≤ m | ( d + β − m − | . This lemma directly implies Proposition 2.

Proof

For ∆ ∈ S ( d ) and δ ∈ R we compute for f ( V ) = n P ni =1 q x ⊤ i V x i , d dδ f ( V + δ ∆) = − n n X i =1 ( x ⊤ i ∆ x i ) q x ⊤ i ( V + δ ∆) x i . Thus the second directional derivative in V = V ∗ along ∆ is ∇ f ( V ∗ ) = lim δ → d dδ f ( V + δ ∆) = − n n X i =1 ( x ⊤ i ∆ x i ) . Let T x be the semideﬁnite positive quadratic form of S ( d ) deﬁned for ∆ ∈ S ( d ), by T x : ∆ ( x ⊤ ∆ x ) . (29)Then it exists a positive linear operator T x from S ( d ) to S ( d ) such that T x (∆) = h ∆ , T x ∆ i .Therefore the function f will be stricly concave if for all directions ∆ ∈ S ( d )1 n n X i =1 T x i (∆) > . (30) lammarion, Palaniappan and Bach We will bound the empirical expectation in Eq. (30) by ﬁrst showing that its expectationremains away from 0. Then we will use a concentration inequality for matrices to controlthe distance between the sum in Eq. (30) and its expectation.We ﬁrst derive conditions so that the result is true in expectation, i.e. for the operator T deﬁned by T = E T x for x following the same law as ( y, z ⊤ ) ⊤ . We denote by m = E z and by β = E z /m its kurtosis.We let ∆ = (cid:18) a b ⊤ b C (cid:19) and then have x ⊤ ∆ x = a + 2 yb ⊤ z + z ⊤ Cz . Thus T x (∆) = a + 4 ayb ⊤ z + 2 az ⊤ Cz + 4 b ⊤ ( zz ⊤ ) b + ( z ⊤ Cz ) + 4 yb ⊤ z ( z ⊤ Cz ) . Therefore we can express the value of the operator T only in function of the elements of ∆: T (∆) = ( a + m tr C ) + 4 m k b k + 2 m k C − Diag(diag( C )) k F + m ( β − k diag( C ) k , where we have used E ( z ⊤ Cz ) = E X i,j,k,l z i z j z k z l c i,j c k,l = E X i ( z i ) c i,i + E X i,k = i z i z k c i,i c k,k + 2 E X i,j = i z i z j c i,j = βm X i c i,i + m X i,k = i c i,i c k,k + 2 m X i,j = i c i,j = m ( β − X i c i,i + m X i,k c i,i c k,k + 2 m X i,j c i,j = m ( β − k diag( C ) k + m (cid:0) k C k F + tr( C ) (cid:1) = m ( β − k diag( C ) k + m (cid:0) k C − Diag(diag( C )) k F + tr( C ) (cid:1) . Since β ≥

1, we get T (∆) ≥ ( a + m tr C ) + 4 m k b k + 2 m ( k C k F − k diag( C ) k ) . Thus T (∆) = 0 if and only if β = 1 with b = 0 d − and C = diag( c ) with c ⊤ d = − am . Withthe condition β = 1 meaning that var( z ) = 0 and thus z is constant a.s., i.e. z follows aRademacher law.However we would like to bound T (∆) away from zero by some constant and for thatwe are looking for the smallest eigenvalue of the operator E T x . Unfortunately we are notable to solve the optimization problem min ∆ ∈S ( d ) , k ∆ k F =1 T (∆) , and we have to compute all the spectrum of this operator to be able to ﬁnd the smallestusing E T x ∆ = 1 / ∇T (∆) .We have 1 / ∇T (∆) =  a + m tr( C ) 2 mb ⊤ mb ( a + m tr( C )) m I d − + 2 m C + m ( β −

3) Diag(diag( C ))  . For all b ∈ R d − we have for ∆ = (cid:18) b ⊤ b (cid:19) , 1 / ∇T (∆) = 2 m ∆. Thus 2 m is aneigenvalue of multiplicity d − − For all C ∈ R ( d − × ( d − with diag( C ) = 0 d − we have for ∆ = (cid:18) C (cid:19) , 1 / ∇T (∆) =2 m ∆. Thus 2 m is an eigenvalue of multiplicity ( d − d − . − For all c ∈ R d − with c ⊤ d − = 0 we have for ∆ = (cid:18) C ) (cid:19) , 1 / ∇T (∆) = m ( β − m ( β −

1) is an eigenvalue of multiplicity d − − For all a, c ∈ R we have for ∆ = (cid:18) a cI d − (cid:19) ,1 / ∇T (∆) = (cid:18) a + m ( d − c

00 [ ma + m ( d + β − c ] I d − (cid:19) = Diag h (cid:18) m ⊤ d − m d − ( d + β − m I d − (cid:19) (cid:18) ac d − (cid:19) i . Thus an eigenvalue of (cid:18) d − mm ( d + β − m (cid:19) with an eigenvector [ a, c ] ⊤ would be aneigenvalue of the operator E T x with a corresponding eigenvector (cid:18) a cI d − (cid:19) . Thismatrix has two simple eigenvalues µ ± = 1 + ( d + β − m ± p (1 + ( d + β − m ) − m ( β − . (31)Moreover when we add all the multiplicity of the found eigenvalues we get d − ( d − d − + d −

2+ 2 = d ( d +1)2 which is the dimension of S ( d ), therefore we have found all the eigenvaluesof the linear operator E T x .We will prove now than the smallest eigenvalue is µ − when the dimension d is largeenough with regards to m and 2 m otherwise. Lemma 14

Let µ and µ be the two smallest eigenvalues of the operator E T x . Let usassume that d ≥ (the case d = 2 will also be done in the proof ).If β ≥ and m ≤ β − d + β − then µ = 2 m . Otherwise µ = µ − ≥ m ( β − d + β − m and µ = min { m , m ( β − , m } . Moreover we denote by ∆ min = (cid:18) c min I d − (cid:19) the eigenvector associated to µ − forwhich we have set without loss of generality the ﬁrst component a = 1 . Then | c min | ≤ m | ( d + β − m − | . lammarion, Palaniappan and Bach Unfortunately µ − can become small when the dimension increases as explained by thetight bound µ − ≥ m ( β − d + β − m . However the corresponding eigenvector have a particularstructure we will be able to exploit. Proof

First we note that µ − ≤ m ( β −

1) and compute µ − ≥ m ⇔ d + β − m − p (1 + ( d + β − m ) − m ( β − − m ≥ ⇔ d + β − m − m ≥ p (1 + ( d + β − m ) − m ( β − ⇔ (1 + ( d + β − m − m ) ≥ (1 + ( d + β − m ) − m ( β − d + β − m ≥ ⇔ m − m (1 + ( d + β − m ) ≥ − m ( β − d + β − m ≥ ⇔ d + β − m ≤ β − d + β − m ≥ . − If d = 2, – If β ≤ β ≤ m ≥ − β − β ) and the second m ≤ / (4 − β ). Thus we should have (4 − β )(3 − β ) ≤ − β )which is not possible since the polynomial β − β + 8 ≥ – If β ≥

3, the ﬁrst equation gives m ≤ β − β − ≤ m ≤ / (4 − β ) ≤ β − β − ≤ β ≤ − If d ≥

3, the ﬁrst equation implies that β ≥ m ≤ β − d + β − ≤ min = (cid:18) c min I d − (cid:19) the eigenvector for which we have set without lossof generality a = 1 and c min = − d − m hp (( d + β − m − + 4( d − m − ( d + β − m + 1 i . Consequently c min ≤ p (( d + β − m − + 4( d − m ≤ (( d + β − m −

1) + d − m | ( d + β − m − | . Therefore | c min | ≤ m | ( d + β − m − | . We will control now the behavior of the empirical expection by its expectation thanks to con-centration theory. By deﬁnition T x is a symmetric positive linear operator as its projection T ⊥ x onto the orthogonal space of ∆ min . We can thus apply the Matrix Chernoﬀ inequal-ity from Tropp (2012, Theorem 5.1.1) to these two operators using k T x k op ≤ k xx ⊤ k ≤ tr( xx ⊤ ) ≤ k x k ≤ R d Then: P λ min (cid:16) X k =1 T x k (cid:17) ≤ nδµ ! ≤ d h e − (1 − δ ) δ δ i nµ / (2 R d ) ≤ de − (1 − δ ) nµ / (4 R d ) , λ min (cid:16) X k =1 T ⊥ x k (cid:17) ≤ nδµ ! ≤ d h e − (1 − δ ) δ δ i nµ / (2 R d ) ≤ de − (1 − δ ) nµ / (4 R d ) , For m = 1 and d ≥ µ = µ − ≥ β − β + d ≥ min { β − β , β − d } ≥ min { / , β − d } . D.5 Noise robustness for the -dimensional balanced problem We want a condition on ε such that the solution of the relaxation recovers the right y . Werecall the dual problem of the relaxation Eq. (8)min µ ⊤ n s.t. Diag( µ ) < X ( X ⊤ X ) − X ⊤ . The KKT conditions are: − Dual feasibility: Diag( µ ) < X ( X ⊤ X ) − X ⊤ . − Primal feasibility: Diag( Y ) = 1 n and Y < − Complimentary slackness : Y [Diag( µ ) − X ( X ⊤ X ) − X ⊤ ] = 0For Y = yy ⊤ a rank one matrix, the last condition implies Diag( µ ) y = Hy and µ i = ( X ( X ⊤ X ) − X ⊤ y ) i y i . For X = y + ε , we denote by ˜ y = y + ε , then X ( X ⊤ X ) − X ⊤ = ˜ y ˜ y ⊤ k ˜ y k and X ( X ⊤ X ) − X ⊤ y = ˜ y ⊤ y k ˜ y k ˜ y . Thus µ i = ˜ y ⊤ y k ˜ y k ˜ y i y i . Assume that all ˜ y i y i have the same sign, without loss of generality we assume ˜ y i y i >

0. Bydeﬁnition of µ , µ ≥

0. To show the dual feasibility we have to show that Diag( µ ) < H whichis equivalent to Diag( ˜ y i y i ) < ˜ y ˜ y ⊤ ˜ y ⊤ y , to I n − Diag( q y i ˜ y i ) ˜ y ˜ y ⊤ ˜ y ⊤ y Diag( q y i ˜ y i ) < P y i ˜ y i ≤ ˜ y ⊤ y which is obviously true. Reciprocally if µ is dual feasible then Diag( µ ) < y i y i have the same sign.Therefore we have shown that y is solution of the relaxation Eq. (8) if and only if allthe ˜ y i y i have the same sign. If ε and y are independent this is equivalent to k ε k ∞ ≤ D.6 The rank-one candidates are not solutions of the relaxation

We assume now that 1 ⊤ n y = 0 thus y = Π n y , which means we do not have the sameproportion in the two clusters. Let us assume that Π n y takes two values { πy − , πy + } thatis by deﬁnition of Π n πy + = 1 − ⊤ n yn and πy − = − − ⊤ n yn . For V ∗ deﬁned as before, we get x ⊤ i V ∗ x i = ( πy i ) and with I ± the set of indices such that Π n y i = πy ± , the KKT conditionsfor V = V ∗ can be written as1 n h X i ∈ I + (cid:16) πy + − (cid:17) x i x ⊤ i + X i ∈ I − (cid:16) − πy − − (cid:17) x i x ⊤ i i = A n A n V ∗ = 0 . lammarion, Palaniappan and Bach We check that with n ± = { I ± } : w ⊤ A n w = 0 = X i ∈ I + (cid:16) πy + − (cid:17) ( πy + ) + X i ∈ I − (cid:16) − πy − − (cid:17) ( πy − ) = n + (cid:16) πy + − (cid:17) ( πy + ) + n − (cid:16) − πy − − (cid:17) ( πy − ) = n + πy + − n − πy − − (cid:0) n + ( πy + ) + n − ( πy − ) (cid:1) = y ⊤ Π n y − (Π n y ) ⊤ Π n y = y ⊤ Π n y − y ⊤ Π n y = 0 . And A n = n (cid:2) P i ∈ I + α + x i x ⊤ i + P i ∈ I − α − x i x ⊤ i (cid:3) with α + = (cid:0) πy + − (cid:1) and α − = (cid:0) − πy − − (cid:1) .Unfortunately α + α − ≤

0, and A n is not necessary negative. Even worse we will show that E A is not semi-deﬁnite negative which will conclude the proof since by the law of largenumber lim n →∞ n A n = E A . Assume that the proportions of the two clusters stay constantwith n ± = ρ ± n , then E A = ρ + α + (cid:18) ( πy + ) I (cid:19) + ρ − α − (cid:18) ( πy − ) I (cid:19) . And ρ + α + ( πy + ) + ρ − α − ( πy − ) = 0 since w ⊤ A n w = 0. Then ρ + α + + ρ − α − = ρ + πy − − ρ − πy + − πy + πy − πy + πy − = − ( ρ + + ρ − ) − ⊤ n yn ( ρ + − ρ − ) + (1 − (1 ⊤ n y ) ) − (1 − ( ⊤ n yn ) )= ⊤ n yn ( ρ + − ρ − ) + ( ⊤ n yn ) )(1 − ( ⊤ n yn ) ) = 2( ⊤ n yn ) (1 − ( ⊤ n yn ) ) ≥ . Thus A = ⊤ n y ) ( n − (1 ⊤ n y ) ) (cid:18) I (cid:19) is not semi-deﬁnite negative and V ∗ is not solution of therelaxation Eq. (12). Appendix E. Auxilliary results for sparse extension

E.1 There is a rank-one solution of the relaxation Eq. (15)Lemma 15

The rank-one solution V ∗ = v ∗ v ∗⊤ is solution of the relaxation Eq. (15) if thedesign matrix X is such that n X ⊤ X has all its diagonal entries less than one. Proof

The KKT conditions are1 n n X i =1 x i x ⊤ i q x ⊤ i W x i − λU − n X ⊤ X = A AW = 0 , with U such that U ij = sign( W ij ) if W ij = 0 and U ij ∈ [ − ,

1] otherwise. For V ∗ = v ∗ v ∗⊤ this gives A = (1 + λ ) n X ⊤ X − λU − n X ⊤ X = λ h X ⊤ Xn − U i with U , = 1 and U i,j ∈ [ −

11] otherwise . e check that AV ∗ = 0. If the design matrix X satsiﬁes assumption (A1), we can choosea sub-gradient U such that the dual variable A = 0 and thus V ∗ is solution. Otherwise byproperty of semi-deﬁnite matrices, there is a diagonal entry of n X ⊤ X which is bigger than1 which prevents A to be semi-deﬁnite negative since the corresponding diagonal entry of X ⊤ Xn − U will be positive. This shows that V ∗ does not solve the problem. E.2 Proof of proposition 6Lemma 16

For δ ∈ [0 , , with probability − d exp (cid:0) − δ n ( β − dR (1 /m + β + d ) (cid:1) , for any direction ∆ such that V ∗ + ∆ < , we have: g ( V ∗ ) − g ( V ∗ +∆) > (1 − δ ) h λ k ∆ − Diag(∆) k + β − β + d + 1 /m (1 + λ ) k Diag(∆) k i + o ( k ∆ k ) ≥ . Moreover we also have with probability at least − d exp (cid:0) − δ nm ( β − dR (cid:1) , for any symmetricmatrix ∆ such that V ∗ + ∆ < and Diag(∆) ∈ ( e min ) ⊥ : g ( V ∗ ) − g ( V ∗ +∆) > (1 − δ ) h λ k ∆ − Diag(∆) k + m ( β −

1) (1 + λ ) k Diag(∆) k i + o ( k ∆ k ) ≥ . where e min = [1 , c min d − ] is deﬁned in the proof and satisﬁes | c min | ≤ m | ( d + β − m − | . E.2.1 Proof outline

We will investigate under which conditions on X the solution is unique, ﬁrst for a deter-ministic design matrix. We make the following deterministic assumptions on X for δ, ζ ≥ S ⊂ R d : (A1) k X ⊤ Xn k ∞ ≤ (A3) k Z ⊤ Zn − Diag(diag( n Z ⊤ Z )) k ∞ ≤ δ (A2) k Z ⊤ yn k ∞ ≤ δ (A4) λ S min (cid:0) X ⊙ ( X ⊙ ) ⊤ n (cid:1) ≥ ζ > ⊙ the Hadamard (i.e., pointwise) product between matrices and λ S min the minimum eigenvalue of a linear operator restricted to a subspace S . Then with g ( V ) = n P ni =1 q x ⊤ i V x i − λ k V k − n tr X ⊤ XV , we can certify that g will decrease aroundthe solution V ∗ . Lemma 17

Let us assume that the noise matrix veriﬁes assumption (A1,A2,A3,A4), thenfor all direction ∆ such that V ∗ + ∆ < and diag(∆) ∈ S we have: g ( V ∗ ) − g ( V ∗ + ∆) ≥ λ (1 − δ ) k ∆ − Diag(diag(∆)) k + ζ (1 + λ ) k Diag(∆) k + o ( k ∆ k ) > . Let us assume now that ( z i ) i =1 ,.,d are i.i.d of law z symmetric with E z = E z = 0, E z = m = 1, E z / ( E z ) = β and such that k z k ∞ is a.s. bounded by 0 ≤ R ≤

1. Then thematrix X satisﬁes a.s. assumption (A1). Using multiple Hoeﬀding’s inequalities we have lammarion, Palaniappan and Bach Lemma 18 If z does not follow a Rademacher law, the design matrix X satsiﬁes assump-tions (A1,A2,A3,A4) with probability greater than − d exp (cid:0) − δ n ( β − d ( β + d ) R (cid:1) for S = R d ,and with probability greater than − d exp (cid:0) − δ n min { β − , } dR (cid:1) for S = [1 , c min d − ] ⊥ where c min is deﬁned in the proof and satisﬁes | e min | ≤ d + β − . This lemma concludes the proof of proposition 6. We will now prove these two lemmas.

E.2.2 Proof of lemma 17

Proof

Since the dual variable A for the PSD constraint is 0 (see the proof of lemma 15),this constraint W < V ∗ + ∆ < (cid:18) a b ⊤ b C (cid:19) , with C <

0, which is slightly moregeneral than V ∗ + ∆ <

0. We denote by f ( W ) = n P ni =1 q x ⊤ i W x i − n tr X ⊤ XW thesmooth part of g . By Taylor-Young, we have for all W : f ( W ) − f ( W + ∆) = −h f ′ ( W ) , ∆ i − h ∆ , f ′′ ( W )∆ i + o ( k ∆ k ) . Thus: g ( W ) − g ( W + ∆) = −h f ′ ( W ) , ∆ i − h ∆ , f ′′ ( W )∆ i + λ ( k W + ∆ k − k W k ) + o ( k ∆ k ) . In W = V ∗ this gives with X ⊤ X = (cid:18) n y ⊤ ZZ ⊤ y Z ⊤ Z (cid:19) , g ( W ) − g ( W + ∆) = − λ h X ⊤ Xn , ∆ i − h ∆ , f ′′ ( V ∗ )∆ i + λ ( a + 2 k b k + k C k ) + o ( k ∆ k )= λ (cid:2) k b k − n b ⊤ Z ⊤ y ) + k C k − n tr( Z ⊤ ZC ) (cid:3) − h ∆ , f ′′ ( V ∗ )∆ i + o ( k ∆ k ) . And with H¨older’s inequality and assumption (A2) k b k − n b ⊤ Z ⊤ y ≥ k b k (1 − k n Z ⊤ y k ∞ ) ≥ (1 − δ ) k b k . Nevertheless we will show in lemma 19 that k C k − n tr( Z ⊤ ZC ) ≥ (1 − δ ) k C − diag( C ) k ,thus g ( W ) − g ( W + ∆) ≥ λ (1 − δ )(2 k b k + k C − diag( C ) k ) + o ( k ∆ k ) . (32)However in Eq. (32), g ( W ) − g ( W + ∆) = 0 for b = 0 and C diagonal, therefore we haveto investigate second order conditions, i.e. to show for ∆ = diag( e ) with e ∈ R d that −h ∆ , f ′′ ( V ∗ )∆ i > nd with assumption (A4) − λ ) h diag( e ) , f ′′ ( V ∗ ) diag( e ) i = 1 n n X i =1 ( x ⊤ i diag( e ) x i ) = 1 n n X i =1 ( d X j =1 e j ( x ji ) ) = 1 n n X i =1 e ⊤ [ x ⊙ i ( x ⊙ i ) ⊤ ] e ≥ λ min (cid:0) X ⊙ ( X ⊙ ) ⊤ n (cid:1) k e k ≥ ζ k e k . Thus we can conclude: g ( W ) − g ( W + ∆) ≥ λ (1 − δ )(2 k b k + k C − diag( C ) k ) + ζ (1 + λ ) k e k + o ( k ∆ k ) . E.2.3 Auxilliary lemma

Lemma 19

For all matrix C symmetric semi-deﬁnite positive we have under assumptions(A1) and (A3): tr (cid:16) S − Z ⊤ Zn (cid:17) C ≥ (1 − δ ) k C − diag( C ) k > . Proof

We denote by Σ n = Z ⊤ Zn . We always have k C k − tr(Σ n C ) = tr( S − Σ n ) C where S i,j = sign( C i,j ), thus if diag( C ) > S ) = 1 and diag( S − Σ n ) ≥ ni,j ∈ [ − ,

1] then sign( S − Σ n ) = sign( S ).Thus tr( S − Σ n ) C = P i C i,i ( S − Σ n ) i,i + P i = j C i,j ( S − Σ n ) i,j ≥ P i = j C i,j ( S − Σ n ) i,j ≥ | (Σ n ) i,j | ≤ δ for i = j . Thereforetr( S − Σ n ) C ≥ X i = j C i,j ( S − Σ n ) i,j ≥ X i = j | C i,j | (1 − δ ) ≥ (1 − δ ) k C − diag( C ) k > . If there is a diagonal element of C which is 0, then all the corresponding line and columnin C will also be 0 and we can look at the same problem as before by erasing of C and Σ n the corresponding column and line. E.2.4 Proof of lemma 18

Proof

We will ﬁrst show that the noise matrix Z satisﬁes assumptions (A2,A3). ByHoeﬀding’s inequality we have with probability 1 − − δ n/ (2 R ))1 n | n X i =1 z ji | ≤ δ. lammarion, Palaniappan and Bach Then, since the law of z is symmetric y i z i will have the same law as z i and with probability1 − − δ n/ (2 R )), the design matrix Z satisﬁes assumption (A2): k Z ⊤ yn k ∞ ≤ δ. Likewise we have with probability 1 − − δ n/ (2 R )) that for j = j ′ | n n X i =1 z ji z j ′ i | ≤ δ. Thus we also have with probability 1 − d exp( − δ n/ (2 R )) that Z satisﬁes assumption(A3): k n Z ⊤ Z − diag( 1 n Z ⊤ Z ) k ∞ ≤ δ. Thus with probability 1 − d exp( − δ n/ (2 R )), the noise matrix Z satisﬁes assumptions(A1, A2, A3).We proceed as in the proof of proposition 2 to show that X satisﬁes assumption (A4).We ﬁrst derive a condition to have the result in expectation, then we use an inequalityconcentration on matrix to bound the empirical expectation. This will be very similar, butwe will get a better scaling since ∆ is diagonal.Using the same arguments as in the proof of proposition 2 we have for the diagonalmatrix ∆ = diag( e ) with e = ( a, c ) ∈ R d : e ⊤ E ( x ⊙ ( x ⊙ ) ⊤ ) e = E ( x ⊤ ∆ x ) = ( a + mc ⊤ n − ) + m ( β − k c k > β > . We can show that m ( β −

1) is an eigenvalue of multiplicity d − µ ± are eigenvaluesof multiplicity one of the operator ∆ E ( x ⊤ ∆ x ) with eigenvectors e ± . Thus we have λ min ( E x ⊙ ( x ⊙ ) ⊤ ) = 1 + ( d + β − m − p (1 + ( d + β − m ) − m ( β − ≥ m ( β − d + β − m , and λ e ⊥− min ( E x ⊙ ( x ⊙ ) ⊤ ) = m ( β − . Moreover λ max (cid:16) x ⊙ ( x ⊙ ) ⊤ (cid:17) = ( x ⊙ ) ⊤ x ⊙ = d X j =1 ( x i ) ≤ dR . Thus we can apply the Matrix Chernoﬀ inequality from (Tropp, 2012) for µ S = λ S min ( E x ⊙ ( x ⊙ ) ⊤ ): P λ S min (cid:16) X ⊙ ( X ⊙ ) ⊤ n (cid:17) ≤ (1 − δ ) µ S ! ≤ de − δ nµ S / (2 dR ) . Thus with probability 1 − d exp( − δ nµ − / (2 dR )) the design matrix X satisﬁes as-sumption (A1,A2,A3,A4) with ζ = (1 − δ ) µ − and S = R d . And with probability 1 − d exp( − δ n min { β − , } / (2 dR )) the design matrix X satisﬁes assumption (A1,A2,A3,A4)with ζ = (1 − δ ) min { β − , } and S = e ⊥− . ppendix F. Proof of multi-label results We ﬁrst prove the lemma 7:

Proof

Let A ∈ R k × k symmetric semi-deﬁnite positive such that diag(˜ yA ˜ y ⊤ ) = 1 n , thendiag(˜ yA ˜ y ⊤ ) = k X i =0 a i,i n + 2 k X i =1 a ,i y i + 2 X ≤ i

Since a + P ki =1 a i α i ≥ α min P ki =0 a i = α min we should have α ≥ α min . Wehave already seen that such Y satisﬁes the constraint. The KKT conditions are: B =diag( µ ) − H − ν ⊤ < BY = 0. Since y i = Π n y i + ( y ⊤ i n ) n n . Hy i = H Π n y i + ( y ⊤ i n ) H n = Π n y = ( y i − ⊤ n y i n n ) . Thus HY = k X i =1 a i Hy i y ⊤ i = k X i =1 a i ( y i − ⊤ n y i n n ) y ⊤ i = k X i =1 a i ( y i y ⊤ i − ⊤ n y i n n y ⊤ i )and tr( HY ) = P ki =1 a i ( n − nα i ) = n (1 − a + a − α ) = n (1 − α ).Furthermore since 1 ⊤ n diag( Y ) = n and 1 ⊤ n M n = n α , for µ = 1 n and ν = 1 /n , B.Y = n − n (1 − α ) − nα = 0. And since B = I n − n n ⊤ n − H , B = B and B ⊤ = B , thusB is a symmetric projection and consequently symmetric semi-deﬁnit positive.Hence the primal variable Y and the dual variables µ = 1 n and ν = 1 /n satisfy theKKT conditions, thus M is solution of this problem. lammarion, Palaniappan and Bach Appendix G. Eﬃcient optimization problem

G.1 Dual computation

We consider the following strongly-convex approximation of Eq. (21), augmented with thevon-Neumann entropy:max V < n n X i =1 q ( XV X ⊤ ) ii − k Diag( c ) V Diag( c ) k − ε tr[( A V A ) log( A V A )] s . t . tr( A V A ) = 1 . Introducing dual variables, we havemin u ∈ R n + ,C : | C ij | c i c j max V < n n X i =1 (cid:16) u i (( XV X ⊤ ) ii ) + 1 u i (cid:17) − tr CV − ε tr[( A V A ) log( A V A )]s . t . tr( A V A ) = 1 . By ﬁxing u and C , and letting Q = A V A , we can write the max problem asmax Q < tr A − ( 12 n X ⊤ Diag( u ) X − C ) A − Q − ε tr[ Q log( Q )]s . t . tr Q = 1 . This problem is of the formmax Q < tr DQ − ε n X i =1 σ i ( Q ) log σ i ( Q )s . t . tr Q = 1where D = A − ( n X ⊤ Diag( u ) X − C ) A − and σ i ( Q ) denotes the i -th largest eigen valueof the matrix Q . If we consider the matrix D to be of the form D = U Diag( θ ) U ⊤ with θ denoting the vector of ordered eigen values of D , then it turns out that at optimality Q hasthe form Q = U Diag( σ ) U ⊤ , with σ denoting the ordered vector of eigen values of Q .Therefore the above optimization problem can be cast in terms of σ as:max σ ∈ R n θ ⊤ σ − ε n X i =1 σ i log σ i s . t . n X i =1 σ i = 1 . The solution of this problem is σ i = e θi/ε P nj =1 e θj/ε , which leads tomin θ ∈ R n φ ε ( θ ) = ε log n X i =1 (cid:16) e θiε (cid:17) . In terms of the original matrix variables, we havemin φ ε ( D ) = ε log tr e Dε . sing the appropriate expansion of D , we have the overall optimization problem asmin u ∈ R n + ,C : | C ij | c i c j n n X i =1 u i + φ ε ( A − ( 12 n X ⊤ Diag( u ) X − C ) A − ) . (34)At optimality, we have A V A = (cid:16) e ( A −

12 ( 12 n X ⊤ Diag( u ) X − C ) A −

12 ) ε (cid:17) / tr (cid:16) e ( A −

12 ( 12 n X ⊤ Diag( u ) X − C ) A −

12 ) ε (cid:17) . The error of approximation is at most ε log d and the Lipschitz constant associated withthe function φ ε ( · ) is ε . G.2 Algorithm details

We write the optimization problem Eq. (34) as:min u ∈ R n + F ( u, C ) + H ( u, C )where H ( u, C ) = φ ε ( A − ( 12 n X ⊤ Diag( u ) X − C ) A − )is the smooth part and F ( u, C ) = I C : | C ij | c i c j + 12 n n X i =1 u i is the non-smooth part.The gradient ∇ u of H ( u, C ) with respect to u is ∇ u = diag( B ⊤ U Diag( σ ) U ⊤ B ) . where B = √ n A − X ⊤ and the gradient of H ( u, C ) with respect to C is ∇ C = ( A − U Diag( σ ) U ⊤ A − ) . The Lipschitz constant L associated with the gradient ∇ H ( u, C ) is L = 2 ε max (cid:16) λ max ( B ⊤ B ⊙ B ⊤ B ) , λ max ( A − ) (cid:17) , (35)where λ max ( M ) denotes the maximum eigen value of matrix M . Computing L takes O (max( n, d ) ) time and L needs to be computed once at the beginning of the algorithm.The resultant FISTA procedure is described in Algorithm 1. Note that the FISTAprocedure ﬁrst computes intermediate iterates (¯ u k − , ¯ C k − ) (Step 7, Algorithm 1) by takingdescent steps along the respective gradient directions. Then two distinct problems in u and C (respectively Steps 8 and 9 in Algorithm 1) are solved. The sub-problem in u (Step8) can be eﬃciently solved using a Newton procedure followed by a thresholding step, asillustrated in Algorithm 2. The sub-problem in C (Step 9) can also be solved using a simplethresholding step. lammarion, Palaniappan and Bach Algorithm 1

FISTA Algorithm to solve Eq. (34) Input X . Compute Lipschitz constant L . Let ( u , C ) be an arbitrary starting point. Let (¯ u , ¯ C ) = ( u , C ), t = 1. Set the maximum iterations to be K . for k = 1 , , . . . , K do ⊲ The loop can also be terminated based on duality gap. (¯ u k − , ¯ C k − ) = (cid:16) ¯ u k − L ∇ ¯ u k , ¯ C k − L ∇ ¯ C k (cid:17) . Obtain u k = argmin u ∈ R n + n L k u − ¯ u k − k + n P ni =1 1 u i } by Algorithm 2. Obtain C k = argmin C n I C : | C ij | c i c j + L k C − ¯ C k − k F o by thresholding. t k = q t k − . (¯ u k , ¯ C k ) = ( u k , C k ) + ( t k − − t k (cid:16) ( u k , C k ) − ( u k − , C k − ) (cid:17) . end for Output ( u K , C K ). Algorithm 2

Newton method to solve u sub-problem Input u k − , n , L . u i = max( u k − i , nL ) ) , i = 1 , , . . . , n . Set M to be the max number of Newton steps. for t = 1 , , . . . , M do for i = 1 , , . . . , n do u ti = nL ( u t − i ) u k − i +3 u ti nL ( u t − i ) +1) . end for end for Output max( u M ,0).