[PDF] Biconvex Clustering

Abstract

Convex clustering has recently garnered increasing interest due to its attractive theoretical and computational properties, but its merits become limited in the face of high-dimensional data. In such settings, pairwise affinity terms that rely on k -nearest neighbors become poorly specified and Euclidean measures of fit provide weaker discriminating power. To surmount these issues, we propose to modify the convex clustering objective so that feature weights are optimized jointly with the centroids. The resulting problem becomes biconvex, and as such remains well-behaved statistically and algorithmically. In particular, we derive a fast algorithm with closed form updates and convergence guarantees, and establish finite-sample bounds on its prediction error. Under interpretable regularity conditions, the error bound analysis implies consistency of the proposed estimator. Biconvex clustering performs feature selection throughout the clustering task: as the learned weights change the effective feature representation, pairwise affinities can be updated adaptively across iterations rather than precomputed within a dubious feature space. We validate the contributions on real and simulated data, showing that our method effectively addresses the challenges of dimensionality while reducing dependence on carefully tuned heuristics typical of existing approaches.

Full PDF

BBiconvex Clustering

Saptarshi Chakraborty and Jason Xu ∗ Department of Statistics, University of California, Berkeley Department of Statistical Science, Duke University

Abstract

Convex clustering has recently garnered increasing interest due to its attractive theoretical and com-putational properties. While it confers many advantages over traditional clustering methods, its meritsbecome limited in the face of high-dimensional data. In such settings, not only do Euclidean measuresof ﬁt appearing in the objective provide weaker discriminating power, but pairwise aﬃnity terms thatrely on k -nearest neighbors become poorly speciﬁed. We ﬁnd that recent attempts which successfullyaddress the former diﬃculty still suﬀer from the latter, in addition to incurring high computational costand some numerical instability. To surmount these issues, we propose to modify the convex clusteringobjective so that feature weights are optimized jointly with the centroids. The resulting problem becomesbiconvex and as such remains well-behaved statistically and algorithmically. In particular, we derive afast algorithm with closed form updates and convergence guarantees, and establish ﬁnite-sample boundson its prediction error that imply consistency. Our biconvex clustering method performs feature selec-tion throughout the clustering task: as the learned weights change the eﬀective feature representation,pairwise aﬃnities can be updated adaptively across iterations rather than precomputed within a dubiousfeature space. We validate the contributions on real and simulated data, showing that our method eﬀec-tively addresses the challenges of dimensionality while reducing dependence on carefully tuned heuristicstypical of existing approaches. Keywords:

Sparse clustering, Variable Weighing, Convex Optimization, Feature Selection, Finite sampleerror bounds.

Clustering is a cornerstone of unsupervised learning that seeks to partition unlabeled data into groupsaccording to some measure of similarity. Classical approaches such as k -means clustering (MacQueen et al.,1967) typically formulate the task as a non-convex optimization problem and seek a solution via a greedyalgorithm. Such methods are often eﬀective in practice and have endured due to their simplicity, but also ∗ Correspondence to: [email protected] a r X i v : . [ s t a t . M E ] A ug uﬀer from well-documented shortcomings (Jain, 2010). These include requiring prior knowledge or tuningof the number of clusters k as input (Tibshirani et al., 2001; Hamerly and Elkan, 2004), as well as instabilityand sensitivity to initial conditions due to non-convexity (Ostrovsky et al., 2013; Xu and Lange, 2019), anddeteriorating performance in high dimensions (Witten and Tibshirani, 2010; de Amorim, 2016; Chakrabortyand Das, 2019).While ongoing research continues to combat the shortcomings of algorithms such as k -means througheﬀective seeding schemes and annealing techniques (Bachem et al., 2016; Xu and Lange, 2019), an alterna-tive approach is to study convex relaxations of traditionally non-convex problems (Tropp, 2006). Convexclustering, also called sum-of-norms clustering, has recently garnered growing interest due to its attractivetheoretical properties and viable algorithms (Pelckmans et al., 2005; Hocking et al., 2011; Lindsten et al.,2011). Given data X ∈ R n × p and denoting the j th row of a matrix Z as z j · , the convex clustering objectiveis given by min µ n (cid:88) i =1 (cid:107) x i · − µ i · (cid:107) + γ n (cid:88) i

0. The approach can be understood as a convex relaxation of hierarchical clustering.Pairwise aﬃnities φ ij > µ . Indeed, the method entails a continuous solution path as a function of theparameter γ — a larger γ gives the fusion penalty more relative inﬂuence, resulting in fewer unique centersor clusters (Chi and Lange, 2015). Convexity ensures a unique global minimizer, suggesting stability of anyvalid iterative optimization algorithm regardless of initial condition. A convex formulation is also attractivefrom a theory perspective; works by Zhu et al. (2014); Tan and Witten (2015); Radchenko and Mukherjee(2017) provide recovery guarantees, and Chi and Steinerberger (2019) establish conditions under which, thesolution path recovers a tree. Computationally, Chi and Lange (2015) provide a framework based on splittingmethods that render the work of Lindsten et al. (2011) practical in a uniﬁed framework for various choicesof q deﬁning the fusion penalty norm. The idea has been extended to many related tasks including tensors,metric versions, co-clustering, multi-view and histogram-valued data. (Wu et al., 2016; Chi et al., 2018; Parket al., 2019; Wang and Allen, 2019).Despite the recent success of convex clustering, two notable and related shortcomings include its limi-tations in high dimensions and its strong reliance on careful speciﬁcation of aﬃnities φ ij . Typically, usersfollow a recommendation that combines k -nearest neighbors ( k -NN) and Gaussian kernels (Chi and Lange,2015): that is, φ ij = k { i,j } exp (cid:26) − (cid:107) x i − x j (cid:107) τ (cid:27) (2)where the indicator k { i,j } is equal to 1 if x j is among the k -NNs of x i with respect to (cid:107)·(cid:107) and 0 otherwise, andthe constant τ is a bandwidth parameter. The success of such heuristics for choosing φ ij varies dependingon the application, and improper speciﬁcation may lead to pathological behavior including splits in theclustering path or abrupt merging to the overall mean (Hocking et al., 2011; Chi and Lange, 2015). Further,this Gaussian “blurring” becomes ineﬀective in high dimensions where pairwise Euclidean distances appearingin the kernel as well as k -NN evaluations become less informative (Aggarwal et al., 2001). Though a convexformulation ensures a unique global optimum, the instability it seeks to address is conserved in this sense:the method becomes fragile with respect to good choice of φ ij rather than initial guess.Toward addressing high dimensionality, a sparse variant of convex clustering has been developed by Wanget al. (2018), and proposes to include an additional group lasso penalty on the columns of the centroid matrixas follows: min µ n (cid:88) i =1 (cid:107) x i · − µ i · (cid:107) + γ (cid:88) i

Consider a toy dataset simulated from two ground truth clusters each with 100associated points. The data are designed so that they can be discriminated along only the ﬁrst two features.Next, we add twelve uninformative features generated from a standard normal distribution that serve onlyto decrease the signal-to-noise ratio. We study the results of various clustering algorithms by visualizingtheir solutions projected onto the two relevant dimensions. The performances of diﬀerent peer algorithmsare shown in Figure 1.Our proposed method successfully recovers the ground truth (Figure 1a), while prior work that seeks asparse solution, solves a convex relaxation, or both fail due to the large number of noise variables. We examinethe performance of sparse convex clustering (Wang et al., 2018) in particular to further emphasize the eﬀect ofupdating aﬃnities by way of the learned feature weight vector. To understand why sparse convex clusteringfails to recover the true cluster structure here despite attempting to perform feature selection as evident inFigure 1d, we examine a heatmap of the aﬃnities φ ij computed under Euclidean distance as deﬁned in (2).Figure 2a shows that because of the large relative inﬂuence of noise features, pairwise distances computed inthe original Euclidean space provide only limited information, reﬂected in the noisy pattern of φ ij in the leftpanel. In contrast, updating φ ij according to (2) where distances are deﬁned under the learned feature spaceinduced by w reveals clear structure (Figure 2). To further highlight this point, we may re-run sparse convexclustering provided with the aﬃnities learned by our algorithm as inputs. The method is now able to recover4 l lll ll l lll lll ll l ll l llll l lll ll l ll ll l lll lll ll ll l lllll lll ll lll l lll lllll l llll l ll ll l lll ll llllll ll ll ll ll l lll llll ll lll l lll l lll lll lllll lll l l ll llll ll lll ll lll l lll l lll ll ll l lll ll ll l llll lll ll lll l l ll ll lll ll llll ll l −4−202 −2 0 2 Feature 1 F ea t u r e (a) Ground Truth ll lll ll l lll lll ll l ll l llll l lll ll l ll ll l lll lll ll ll l lllll lll ll lll l lll lllll l llll l ll ll l lll ll llllll ll ll ll ll l lll llll ll lll l lll l lll lll lllll lll l l ll llll ll lll ll lll l lll l lll ll ll l lll ll ll l llll lll ll lll l l ll ll lll ll llll ll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll −4−202−4 −2 0 2 Feature 1 F ea t u r e (b) Biconvex ll lll ll l lll lll ll l ll l llll l lll ll l ll ll l lll lll ll ll l lllll lll ll lll l lll lllll l llll l ll ll l lll ll llllll ll ll ll ll l lll llll ll lll l lll l lll lll lllll lll l l ll llll ll lll ll lll l lll l lll ll ll l lll ll ll l llll lll ll lll l l ll ll lll ll llll ll l −4−202−4 −2 0 2 Feature 1 F ea t u r e (c) Convex Clustering ll lll ll l lll lll ll l ll l llll l lll ll l ll ll l lll lll ll ll l lllll lll ll lll l lll lllll l llll l ll ll l lll ll llllll ll ll ll ll l lll llll ll lll l lll l lll lll lllll lll l l ll llll ll lll ll lll l lll l lll ll ll l lll ll ll l llll lll ll lll l l ll ll lll ll llll ll l −4−202−4 −2 0 2 Feature 1 F ea t u r e (d) Sparse Convex ll lll ll l lll lll ll l ll l llll l lll ll l ll ll l lll lll ll ll l lllll lll ll lll l lll lllll l llll l ll ll l lll ll llllll ll ll ll ll l lll llll ll lll l lll l lll lll lllll lll l l ll llll ll lll ll lll l lll l lll ll ll l lll ll ll l llll lll ll lll l l ll ll lll ll llll ll l −4−202 −2 0 2 Feature 1 F ea t u r e (e) Sparse k -means ll lll ll l lll lll ll l ll l llll l lll ll l ll ll l lll lll ll ll l lllll lll ll lll l lll lllll l llll l ll ll l lll ll llllll ll ll ll ll l lll llll ll lll l lll l lll lll lllll lll l l ll llll ll lll ll lll l lll l lll ll ll l lll ll ll l llll lll ll lll l l ll ll lll ll llll ll l −4−202 −2 0 2 Feature 1 F ea t u r e (f) Sparse Hierarchical ll lll ll l lll lll ll l ll l llll l lll ll l ll ll l lll lll ll ll l lllll lll ll lll l lll lllll l llll l ll ll l lll ll llllll ll ll ll ll l lll llll ll lll l lll l lll lll lllll lll l l ll llll ll lll ll lll l lll l lll ll ll l lll ll ll l llll lll ll lll l l ll ll lll ll llll ll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll −4−202−4 −2 0 2 Feature 1 F ea t u r e (g) Sparse Convex, withlearned aﬃnities ll lll ll l lll ll l ll l ll l llll l lll ll l ll ll l lll lll ll ll l lllll lll ll ll l l l ll lll ll l lll l l ll ll l lll ll llllll ll ll ll ll l lll llll ll lll l lll l lll lll l llll lll l l ll llll ll lll ll lll l lll l lll ll ll l lll ll ll l llll lll ll lll l l ll ll lll ll llll ll l −2−1012 −2 −1 0 1 2 Feature 1 F ea t u r e (h) Solution paths of Bi-convex Clustering Figure 1: Performance of diﬀerent peer algorithms on a toy dataset, showing the eﬃcacy of biconvex clus-tering in a low signal-to-noise-ratio setting.the ground truth, displayed in Figure 1g. Note that despite achieving a perfect clustering on this relativelysimple synthetic data, the estimated centers are noticeably biased toward the origin. This behavior arisesfrom applying shrinkage penalties directly to the centroids, and can negatively aﬀect clustering accuracy inmore realistic and challenging data settings.The rest of the paper is organized as follows. Section 2 introduces the biconvex clustering formulation,and establishes some intuition in relation to prior work. We next derive a coordinate descent algorithmwith simple, closed-form updates toward solving the resulting optimization problem with respect to thecentroids and feature weights. Notably, as the feature representation adapts under the learned featureweights, users have the option to update pairwise aﬃnities φ ij across iterations rather than rely on their apriori initialization throughout the clustering task. Next, the properties of the proposed method are studiedclosely in Section 3, where we begin by deriving ﬁnite sample prediction bounds that imply consistency ofthe estimates. This analysis is followed by establishing convergence of the descent method and assessingthe computational complexity and practical considerations of the algorithm. These results are validatedempirically in Section 4, where performance is thoroughly assessed by standard clustering metrics as wellas in terms of feature selection over several simulation studies. We then apply the method to several casestudies in Section 5, including a high-dimensional movement data corpus and a DNA microarray study ofhuman leukemia, followed by discussion. 5 Z (a) Default Aﬃnities (as in equation (2)) X Y Z (b) Learned by Biconvex clustering Figure 2: Heatmap of aﬃnities φ ij show that default aﬃnities can be misleading, while those learned bybiconvex clustering reveal the cluster structure of the toy dataset. In this section, we present biconvex clustering and discuss the intuition behind the proposed formulation. Wethen derive a block coordinate descent algorithm for eﬃciently minimizing the resulting objective function.

Let X ∈ R n × p , where the row x i · contains feature values of the i th data point. Consider the objectivefunction f ( µ , w ) = n (cid:88) i =1 (cid:107) x i · − µ i · (cid:107) w + γ n (cid:88) i (cid:54) = j φ ij (cid:107) µ i · − µ j · (cid:107) subject to w l ≥ , p (cid:88) l =1 w l = 1 , (5)where the optimization variables ( w , µ ) denote the vector of feature weights and an n × p matrix whoserows contain the centroids, respectively. For a vector y ∈ R p , we deﬁne the norm induced by w as (cid:107) y (cid:107) w := (cid:80) pl =1 ( w l + λw l ) y l . We see that the ﬁrst component (cid:80) ni =1 (cid:107) x i · − µ i · (cid:107) w of (5) assesses theﬁt between the centroids µ and the data X measured in a way that is weighed by w (Huang et al., 2005).Note that if we ﬁx all w l ∝ (cid:107) y (cid:107) w becomes a scalar multiple of the Euclidean norm (cid:107) y (cid:107) ,and thus minimizing (5) is equivalent to convex clustering as a special case. Like convex clustering, theprocedure begins with n centroids—one at each data point—which merge toward each other due to the term (cid:80) ni

7s well as its successive diﬀerences. Tibshirani et al. (2005) mention they do not consider other values of α in (8) because piecewise constant coeﬃcient solutions under α = 1 are desirable and easily interpretable intheir context. Other natural choices such as α = 2 have appeared much less frequently in the literature incontexts such as precision estimation (Hebiri et al., 2011; Price et al., 2015; Lam et al., 2016). When suchexact sparsity is not the end goal, however, such alternate choices may produce the desired behavior yetyield more elegant solutions. We will argue that this is the case in our present context.The following sections show that squaring the norms appearing in the fusion penalty will confer substantialcomputational advantages. This slight “ridge fusion” modiﬁcation entails the resulting objective f ( µ , w ) = n (cid:88) i =1 (cid:107) x i · − µ i · (cid:107) w + γ n (cid:88) i (cid:54) = j φ ij (cid:107) µ i · − µ j · (cid:107) subject to w l ≥ , p (cid:88) l =1 w l = 1 , (9)which now encourages centroids to merge toward each other by way of penalizing the quadratic variationbetween row pairs of µ rather than the total variation distance. Doing so no longer results in exact merging—subsets of rows now agglomerate closely rather than coalescing to a single point. Though the resulting clusterassignment under a na¨ıve interpretation would be trivial—each point is assigned to its own centroid—wedemonstrate how to obtain a nontrivial clustering from the solution to (9) analogous to that under exactmerging in Section 3.2. In particular, assigning clusters as so shares the same worst-case complexity assimply reading oﬀ the unique rows of the solution under (5). This section derives closed form coordinate descent updates for the subproblems in µ and w . We begin byrewriting objective (9), expanding and decomposing the ﬁrst term into two components as written below(still subject to the simplex constraint on w ): f ( µ , w ) = n (cid:88) i =1 p (cid:88) l =1 w l ( x il − µ il ) + γ n (cid:88) i,j =1; i (cid:54) = j p (cid:88) l =1 φ ij ( µ il − µ jl ) + λ p (cid:88) l =1 w l n (cid:88) i =1 ( x il − µ il ) . (10)The block coordinate descent updates for minimizing the objective (10) are given by the solutions to thefollowing two subproblems: • Problem P : Fix w = w , minimize f ( µ , w ) w.r.t. µ . • Problem P : Fix µ = µ , minimize f ( µ , w ) w.r.t. w , subject to (cid:80) pl =1 w l = 1; w l ≥ Problem P . First, observe that if w l = 0, one can simply choose µ il = 0 for all i ∈ { , . . . , n } inorder to decrease the value of the objective function. Thus take w l >

0; diﬀerentiating the objective functionwith respect to µ il , we obtain µ il = γ (cid:80) i (cid:54) = j φ ij µ jl + γ (cid:80) i (cid:54) = j φ ji µ jl + w l x il + λw l x il γ (cid:80) i (cid:54) = j φ ij + γ (cid:80) i (cid:54) = j φ ji + w l + λw l . (11)Fixing µ , the minimization of w is summarized below, with more details of its derivation in the Appendix.8 heorem 1. The solution to Problem P is given by w ∗ l = 12 S (cid:18) α ∗ (cid:80) ni =1 ( x il − µ il ) , λ (cid:19) (12) where for any y ≥ , and x ∈ R , S ( x, y ) denotes the soft thresholding function deﬁned S ( x, y ) =  x − y if x ≥ yx + y if x ≤ − y Otherwise.and α ∗ satisﬁes the equation (cid:80) pl =1 12 S (cid:18) α (cid:80) ni =1 ( x il − µ il ) , λ (cid:19) = 1 . In particular, solutions to each of the subproblems are parameter-separated; that is, all computations canbe carried out component-wise, and the univariate problems can be executed in parallel.

Algorithm 1

Biconvex Clustering algorithm (BCC)

Input : X ∈ R n × p , λ > γ > Output : µ Step 0.

Initialize µ = X , w l = 1 /p, l = 1 , . . . , p , and(optional) φ ij = exp {−(cid:107) x i − x j (cid:107) } { j ∈ k -NN of i under (cid:107)·(cid:107) } i = 1 , . . . , n ; l = 1 , . . . p . repeatStep 1. Update µ (coordinate-separable, can be parallelized) by µ il ← γ (cid:80) i (cid:54) = j φ ij µ jl + γ (cid:80) i (cid:54) = j φ ji µ jl + w l x il + λw l x il γ (cid:80) i (cid:54) = j φ ij + γ (cid:80) i (cid:54) = j φ ji + w l + λw l . Step 2.

Update α ∗ satisfying (cid:80) pl =1 S (cid:0) α (cid:80) ni =1 ( x il − µ il ) , λ (cid:1) = 2 (i.e. via univariate bisection). Step 3.

Update w (coordinate-separable, can be parallelized) by w l ← S (cid:18) α (cid:80) ni =1 ( x il − µ il ) , λ (cid:19) . Step 4. (Optional) Update φ ij = exp {−(cid:107) x i − x j (cid:107) w /p } { j ∈ k -NN of i under (cid:107)·(cid:107) w } i = 1 , . . . , n ; l = 1 , . . . p. until convergence criterion based on objective (10) is reachedBefore analyzing the properties and complexity of the proposed method, we pause to discuss notablediﬀerences from prior work. Recall the sparse convex clustering objective (3) addresses high-dimensionalitythrough penalizing the columns of the centroid matrix µ directly. This approach is intuitive and nicelypreserves convexity in the overall objective. However, the formulation can introduce signiﬁcant shrinkageto the origin, an eﬀect that may lead to bias as well as spurious selection. Computationally, the solutionmethod in sparse convex clustering modiﬁes the same two approaches outlined in Chi and Lange (2015),9amely sparse variants of the alternating minimization (AMA) and alternating directions (ADMM) methods.Computation becomes more complicated, however: one step within the alternating directions method requiresan additional iterative algorithm of ﬁtting p group lasso regressions in the simplest case when q = 2. Perhapsmore troubling is that not only do the two implementations S-AMA and S-ADMM diﬀer in speed, butmay lead to quite diﬀerent clustering solutions (Wang et al., 2018). The stability properties that convexformulations originally aimed to provide thus do not carry over. In contrast, our biconvex formulation can besolved via alternating between simple solutions to each subproblem given by (11) and (12). A return to formreﬂecting the structure of classic methods such as Lloyd’s algorithm, the proposed method marks a departurefrom the majority of existing work on convex clustering and its variants, which rely on variable splittingthat entail additional dual variables and step-size selection. This aﬀords us a transparent and eﬃcient blockcoordinate descent method, summarized in Algorithm 1. We establish convergence guarantees, ﬁnite samplebounds that imply consistency, and desirable computational properties in the following section. This section establishes the theoretical properties of the (global) optimal solutions of the proposed objectivefunction by providing ﬁnite sample bounds on the prediction error. We also analyze the computational costof biconvex clustering and discuss its convergence properties.

We begin by analyzing statistical properties of the biconvex clustering technique by providing the ﬁnitesample error bounds for the prediction error. In particular, these bounds provide suﬃcient conditions forconsistency of the centroid and weight estimates.Subject to the simplex constraint on w , recall the biconvex clustering objectivemin µ , w (cid:26) n (cid:88) i =1 (cid:107) x i · − µ i · (cid:107) w + γ (cid:88) i (cid:54) = j (cid:107) µ i · − µ j · (cid:107) (cid:27) . (13)Throughout this section, we will assume that φ ij are kept ﬁxed and for clarity of exposition will proceedafter vectorizing the problem setup. To this end let x = vec ( X ) and u = vec ( µ ), where vec ( · ) denotes thefunction that ﬂattens a matrix by appending its columns together. Now x , u ∈ R np , with x ( i − p + j = X ij and u ( i − p + j = µ ij . Similarly, let W = diag ( w + λw , . . . , w p + λw p ) ⊗ I p , where ⊗ denotes the Kroneckerproduct. To deal with the penalty term, let D ∈ R [ p n ( n − ] × np be such that D C ( i,j ) = µ i · − µ j · , where C ( i, j )is an index set: then the optimization problem (13) can now be rewritten asmin u ∈ R np (cid:26) ( x − u ) (cid:62) W ( x − u ) + γ (cid:88) i such that for any t > , P ( z (cid:62) M z ≥ t + tr ( M )) ≤ exp (cid:26) − min (cid:18) c t σ (cid:107) M (cid:107) F , c tσ (cid:107) M (cid:107) sp (cid:19)(cid:27) , where, (cid:107) · (cid:107) F and (cid:107) · (cid:107) sp denotes the Frobenius and spectral norms respectively. We now are ready to establish the following theorem providing a ﬁnite sample bound on the predictionerror, denoting the constant γ (cid:48) = γnp . Theorem 2.

Suppose x = u + (cid:15) , where, (cid:15) ∈ R np is a vector of independent sub-gaussian random variables,with mean and variance σ . Let ˆ u and ˆ W be the solutions minimizing (14). Then for γ (cid:48) > σ (1 + λ ) (cid:114) log( p ( n ) ) np , the bound np (cid:107) ˆ u − u (cid:107) W ≤ σ (1 + λ ) (cid:20) n + (cid:115) log( np ) n p (cid:21) + γ (cid:48) (1 − n ) + γ (cid:48) n (cid:88) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) holds with probability at least − p ( n ) − exp (cid:18) − min (cid:26) c log( np ) , c (cid:112) p log( np ) (cid:27)(cid:19) . The complete details of the proof are provided in the Appendix. We pause to examine the result anddiscuss its implications. We see from Theorem 2 that the average prediction error is bounded by oracle terms γ (cid:48) (1 − n ) + γ (cid:48) n (cid:88) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) . Note the ﬁrst summand tends to 0 as n, p → ∞ ; thus, consistency of estimates follow whenever γ (cid:48) (1 − n ) + γ (cid:48) n (cid:80) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) + γ (cid:48) (cid:80) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) = o (1). Regarding when the oracle terms are o (1): suppose p > n and there exists a ﬁxed number of distinguishing features. In this setting, (cid:107) D C ( i,j ) u (cid:107) = O (1), andso (cid:80) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) as well as (cid:80) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) are O ( n ). This implies that our method is predictionconsistent whenever (cid:114) n log( p ( n ) ) p = o (1). We see that up to a logarithmic factor, it is suﬃcient that n = o ( p ) for the condition to hold, which we may expect in suﬃciently high-dimensional settings. Here we examine the algorithmic aspects of biconvex clustering, beginning with its convergence in ﬁnitenumber of iterations.

Theorem 3.

Within a ﬁnite number of steps, Algorithm 1 converges to a coordinate-wise minimum of (9) .Proof.

Let g t be the value of the objective function (9) at iteration t . From the update steps of Algorithm1, it is clear that g t +1 ≤ g t , for all t ∈ N . Thus the sequence { g t } ∞ t =1 is forms a decreasing sequence, and11oreover g t ≥ , ∀ t ∈ N . Thus { g t } ∞ t =1 converges by the monotone convergence theorem. Thus, for all (cid:15) > T ∈ N , such that g T − g T +1 ≤ (cid:15) , so that an absolute or relative convergence criterion based on(9) is satisﬁed in ﬁnite iterations . Because the objective is biconvex, applying Theorem 5.1 of Tseng (2001)immediately implies that the limit point is a coordinate-wise minimum of (9). Complexity

We now examine the computational cost of the proposed method. Evaluating µ il in step 1takes O ( k ) steps where k is the number of nonzero summands, i.e. number of nonzero φ ij coeﬃcients. Sincethere are n × p many computations, it takes O ( npk ) time to complete the centroid updates, which can beexecuted in parallel. Computing (cid:80) ni =1 ( x il − µ il ) requires O ( n ) time for each l = 1 , . . . , p , contributing O ( np )total, while solving for α ∗ via bisection is O ( p ), as is the soft-threshold update of w . Therefore, the overallper-iteration complexity of the algorithm is O ( npk + np + p ) = O ( npk ) . Note for vanilla convex clustering,the computational complexity is O ( n p ) for ADMM and O ( npk ) for AMA, where k is also quadratic in n in the worst case. Thus we are able to either match or improve upon its complexity despite handling theadditional task of variable weighing and selection simultaneously. Cluster assignments from µ Convex clustering promotes centroids to merge at convergence, but onemust “read oﬀ” the unique rows of a matrix V containing all diﬀerence pairs appearing in the penalty termto make cluster assignments. Doing so calls breadth-ﬁrst search (BFS) to identify connected components ofthe graph induced by V (Chi and Lange, 2015), which has linear complexity in the cardinality of the edgeset plus the vertex set O ( |E| + |V| ) = O ( n ). Recall that squaring the penalty term in our approach doesnot lead to exact coalescence of centroids; in place of BFS, we instead determine cluster assignments from µ by a dynamic tree cut (Langfelder et al., 2008) on the resulting dendrogram. The bottleneck computationinvolves creating a distance matrix, which also has O ( n ) complexity but is trivially parallelizable. Eventhough our algorithm does not rely on exact fusion of centroids, the assignment step in our algorithm doesnot entail higher computational cost. The eﬃcacy of this approach is demonstrated in Section 4. Nearest neighbor aﬃnities

When users apply the optional Step 4, Algorithm 1 allows for aﬃnities φ ij to adapt with the learned feature space. Existing approaches to convex clustering have been criticized fortheir strong dependence on choice of φ , which remain ﬁxed throughout the algorithm and strongly inﬂuenceperformance. Under dense choice of φ , clusters may not merge until the trivial solution at the origin, andconvex clustering may perform worse than standard k -means (Tan and Witten, 2015). Further computationalconsiderations arise, for instance in choosing step-sizes within splitting methods based on the eigenvalues ofthe Laplacian associated with the edge set E induced by φ . Our algorithm can take advantage of the samerecommended heuristics for initializing φ , but allow for these a priori choices to correct throughout the taskby recomputing all distances under the induced norm (cid:107) · (cid:107) w as the estimate for w improves. The merits ofthis adaptivity, which occurs in Step 4 of Algorithm 1, are investigated empirically in the following section.12 . . . . Observations H e i gh t (a) Sparse HierarchicalClustering . . . . . . . . Observations H e i gh t (b) Biconvex Clustering . . . . . . Observations H e i gh t (c) Sparse Convex Cluster-ing . . . . Observations H e i gh t (d) Sparse Convex withlearned aﬃnities Figure 3: Dendrogams for solutions to motivating example (Sec. 1). Identifying the two true clusters via atree cut is straightforward, while competing methods are sensitive to choice of cut height.

In this section, we examine several performance aspects of the proposed method. Before comparing thefeature selection and clustering accuracy between biconvex clustering and peer methods, we highlight thereduced reliance on sensitivity to heuristic speciﬁcation of φ ij . To this end, we revisit the simple motivatingexample from Section 1 and examine the dendrograms produced under various settings. Figure 3 displaysthe resulting dendrograms obtained under peer algorithms from the motivating example, showing that theheight of the branches corresponding to correct separation into the two true clusters is markedly larger underour approach than the competitors. This clear separation suggests that any reasonable choice of dynamictree cut provides a stable way to convert the solution ˆ µ to accurate cluster assignments. We do not observethis to be the case when examining dendrograms of competing sparse clustering methods. When initializingthe competing sparse convex clustering approach using our learned aﬃnities φ ij induced by ˆ w (discussedfurther below), the dendrogram structure under the competing approach is largely corrected, though stillvisibly more sensitive to choice of tree cut than our method. Under both settings for choice of φ ij , sparseconvex clustering did not lead to exact merging despite the (cid:96) penalty on columns; thus its solutions are alsovisualized as dendrograms rather than applying breadth-ﬁrst search to identify unique rows of the centroidmatrix.Next, we take a closer look at the eﬀect of various choices of φ ij on performance in a controlled setting.We consider a simulated dataset in a fairly low signal-to-noise regime, with ambient dimension p = 200 butonly 5 features relevant to clustering, and n = 50 observations. The data are visualized using t-SNE in theﬁrst panel of Figure 4a. Panel (b) shows that when updating aﬃnities φ ij with the choice of k = 5 nearestneighbors, biconvex clustering produces an estimate whose dendrogram clearly reveals the true structure.In contrast to convex clustering and its existing variants, this remains true even without sparsifying theaﬃnities via k -nearest neighbors: that is, setting k = n so that the neighbor graph is dense, panel (c) revealsthat the relative height of the dendrogram that corresponds to a perfect recovery of the four true clusters is13 ll ll lll llll ll −600−3000300600 −500 0 500 t−SNE dimension 1 t − S N E d i m en s i on Label l (a) t-SNE plot of the rele-vant features . . . . Observations H e i gh t (b) Aﬃnity updates with k -NN . . . Observations H e i gh t (c) Aﬃnity updates with-out k -NN . . . Observations H e i gh t (d) No aﬃnity updates Figure 4: Dendograms produced by biconvex clustering with and without aﬃnity updates, showing that thebest result is obtained when the aﬃnities φ ij are updated using k -NN (with k = 5).again very pronounced. Finally, we consider biconvex clustering without any aﬃnity updates; the ﬁnal plotcolor coded with the ground truth is shown in Figure 4 (d), which now fails to perfectly recover the truepartition. This study highlights the robustness of our method to heuristic choices in φ ij — the algorithmproduces dendrograms that are able to reveal the ground truth whether or not we sparsify the choice ofaﬃnities. At the same time, the example demonstrates the power of adapting within the learned featurespace, without which it is not possible to cuts the dendrogram and obtain the true clustering.We next consider several simulation studies that more closely examine the merits of biconvex clusteringempirically in terms of feature selection accuracy and clustering accuracy. The following simulation examines feature selection performance of our method compared to the widelyused sparse k -means clustering method of Witten and Tibshirani (2010). Our simulation study begins with n = 1000, p = 100 and k = 5. The matrix Θ k × p whose rows contain the true cluster centroids is generatedas follows:1. Simulate θ j,l ∼ U nif (0 ,

1) for all j = 1 , . . . , k and l = 1 , . . . , θ j,l = 0 for all l (cid:54)∈ { , . . . , } and all j .After obtaining Θ, we generate data X so that only the ﬁrst 5 features are informative toward distinguishingclusters, simulated as follows: x il ∼ k k (cid:88) j =1 N ( θ j,l , . l ∈ { , . . . , } ; x il ∼N (0 ,

1) if l (cid:54)∈ { , . . . , } . We standardize all features and repeat the simulation to generate 30 simulated datasets. We compare resultsunder biconvex clustering with tuning parameters λ = 0 . γ = 100, and sparse k -means with parameter14 l . . . . Features F ea t u r e W e i gh t s (a) Biconvex . . . . Features F ea t u r e W e i gh t s (b) Sparse k -means lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . Features F ea t u r e W e i gh t s (c) Biconvex llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . Features F ea t u r e W e i gh t s (d) Sparse k -means Figure 5: Figures 5a and 5b show the boxplots of the feature weights for the 5 relevant features for biconvexclustering and parse k -means respectively. Figures 5c and 5d show the same for the 95 irrelevant features.Biconvex clustering consistently zeroes out the irrelevant features. Sparse k -means fails to do so whilefrequently assigning nonzero weights to noise variables. s ∗ selected using the gap statistic (Tibshirani et al., 2001) over a ﬁne grid of 100 values of s ∈ [1 . , R package sparcl .Boxplots of the resulting feature weights obtained under both algorithms, separated by relevant andirrelevant features, are displayed in Figure 5. From panels 5a and 5c, we see that biconvex clusteringconsistently assigns weight to all 5 relevant features, while correctly zeroing out the remaining irrelevantfeatures. On the other hand, while Sparse k -means assigns a large chunk of weight to the relevant features(Figure 5b), there is much more variance across datasets. More to the point, Figure 5d shows that sparse k -means fails to consistently zero out irrelevant dimensions, often assigning them weight values on par withthe relevant features. Finally, though our formulation loses convexity, we observe that in practice the methodis quite insensitive to initial guess. We extend this simulation study in the Appendix, demonstrating thatsolutions remain stable to random initializations of w . Having seen that our method is more stable in terms of feature selection in the previous simulation study,we now shift attention to examine the quality of clusterings produced by the various peer algorithms moreclosely. Non-convex objective functions such as those based on k -means are known to become increasinglysusceptible to poor local minima when the number of clusters grows (Lloyd, 1982; Xu and Lange, 2019).Here we show that our biconvex formulation enjoys robustness to this phenomenon compared to competingnonconvex approaches. On the other hand, though we can no longer guarantee that solutions are neces-sarily global minimizers, our method delivers more accurate solutions than convex clustering and its sparsecounterpart. Our empirical studies indicates that this tradeoﬀ is well worth it, and suggests that when theconvex relaxation may be far from the original formulation for which it serves as proxy, its global solutionmay not translate to desirable clustering performance.15 t−SNE dimension 1 t − S N E d i m en s i on (a) Biconvex −40040 −40 0 40 t−SNE dimension 1 t − S N E d i m en s i on (b) Sparse Convex Clustering −40040 −40 0 40 t−SNE dimension 1 t − S N E d i m en s i on (c) Sparse Hierarchical −40040 −40 0 40 t−SNE dimension 1 t − S N E d i m en s i on (d) Sparse k -means −40040 −40 0 40 t−SNE dimension 1 t − S N E d i m en s i on (e) Convex Clustering −40040 −40 0 40 t−SNE dimension 1 t − S N E d i m en s i on (f) Sparse Convex, learned aﬃnities Figure 6: t -SNE plots showing the performance of the peer algorithms when the number of clusters is highin a sparse clustering scenario. 16o illustrate this claim empirically, we simulate data as in Section 4.1 but increase the number of clustersto k = 20. The performance (best performance out of 20 independent runs for sparse k -means and sparsehierarchical clustering) of the peer algorithms are shown in Figure 6, plotted using t -Stochastic Neighbour-hood Embedding ( t -SNE) (Maaten and Hinton, 2008). Cconvex clustering, sparse convex clustering, and thesparse variants of k -means and hierarchical clustering are all implemented using their respective R packages.Sparse k -means tends to deteriorate when it is trapped by a poor local optimum of the objective function,which becomes highly non-convex when k is large. Convex clustering unsurprisingly fails because it is notdesigned for high-dimensionial scenarios where many of the features contain no information about the clusterstructure of the data.The poor performance of sparse convex clustering (Wang et al., 2018) may arise largely due to the strongdependence on choice of the aﬃnity parameters, since the signal-to-noise-ratio is quite low and the selectionof nearest neighbours is signiﬁcantly inﬂuenced by the irrelevant dimensions. To assess this suspicion, we trysparse convex clustering again under the learned aﬃnities resulting from the estimate ˆ w obtained by biconvexclustering. Indeed, Figure 6f shows that the performance of sparse convex clustering is rescued when providedwith ˆ φ ij deﬁned with respect to the weighted feature space induced by ˆ w . Finally, the solution under ourbiconvex clustering method is identical to the ground truth partition, plotted in Figure 6a. Not shown in theplot, biconvex clustering also assigns positive feature weights to only the true discriminative features. Theresulting dendogram are displayed in Figure 9 of the Appendix, which provides further insight validatingthis claim. Due to diﬀerences in implementation, we do not include a detailed runtime comparison. Wenote however that even a naive implementation of biconvex clustering runs signiﬁcantly faster than convexclustering and sparse convex clustering. This simulation study aims to assess the performance of biconvex clustering and competitors as the numberof non-informative features increases. The ﬁrst two features are drawn from four bivariate Gaussian distri-butions with means at ( ± , ±

1) with equal probability; these corners of the cube with side length 2 can bethought of as the true centers along the ﬁrst two dimensions. The remaining d features are uninformative,drawn from a standard normal distribution; we will examine performance as d increases.We run each algorithm 20 times as the number of noise variables d varies from 0 to 30. The averageperformance of each peer methods are shown in Figure 7. Biconvex clustering perfectly recovers the groundtruth clustering of the data for all values of d , which is perhaps unsurprising given the simple ground truthstructure. However, the average performance of the other competing algorithms deteriorate noticeably as thenumber of noise variables increases, performing almost as poorly as random assignment at d = 30. It shouldbe noted that biconvex clustering remains eﬀective even in regimes as the signal-to-noise-ratio continues todrop, yielding perfect clustering when d = 100 (not shown) in this simulation setting.17 umber of Irrelevant Features (d) A v e r age A R I BCC SCC CC Sparse k-means

Figure 7: Performance of diﬀerent peer algorithms in a low signal-to-noise ratio setting as the numberof irrelevant features increases (Section 4.3). Here SCC, BCC, CC denote the Sparse Convex Clustering,Biconvex Clustering and Convex Clustering, respectively.

We turn our attention to several case studies. On these real datasets, we evaluate performance of thealgorithms via the Adjusted Rand Index (ARI) (Hubert and Arabie, 1985) between the output partition andthe ground truth. An ARI value of 1 indicates perfect clustering whereas a value of 0 indicates completemismatch between the ground truth and the partition obtained by the algorithm.

We begin by analyzing the Libras movement data corpus, which consists of 15 classes each containing 24observations. Each class represents a hand movement type, and observations are described via 90 featuresrepresenting the coordinates of movement. The data are publicly available (Dua and Graﬀ, 2017) and asubset of the full dataset was chosen to showcase the sparse convex clustering algorithm Wang et al. (2018),enabling us to consider a conservative comparison to prior work.We normalize all features before clustering, and following Wang et al. (2018), we use 6 movement typesfor evaluation purposes, chosen to avoid excess overlap among some highly correlated classes. Each of thepeer algorithms is tuned and run using their recommended settings, as implemented in their respective R packages. We run all sparse k -means variants from 20 initial guesses and report the best performing trial. Thehyperparameters of convex and sparse convex clustering are selected using the stability selection procedureoutlined by Fang and Wang (2012).We apply biconvex clustering with λ = 0 . γ = 100 and use the dynamicTreeCut package in R toassign cluster labels to the resulting centroid estimates (Langfelder et al., 2008). The partition we obtain is18able 1: Feature selection and clustering performance, Libras movement data.Algorithm k -means 83 6 0.46Sparse Hierarchical Clustering 14 6 0.11Average Linkage 90 4 0.36Convex Clustering 90 8 0.61Sparse Convex Clustering (S-AMA) 63 3 0.31Sparse Convex Clustering (S-ADMM) 13 3 0.31Biconvex Clustering 90 5 shown in Figure 8d. We also run the average linkage algorithm on the dataset, again using dynamicTreeCut on the data, as well as sparse convex clustering algorithm using S-AMA (Wang et al., 2018) and sparse k -means clustering (Witten and Tibshirani, 2010). The solutions are visualized in Figure 8, and the ARIvalues obtained from the peer algorithms are displayed shown in Table 1. We see that the diﬀerence inperformance between the proposed method and existing competitors is clear. Interestingly, we ﬁnd thatbiconvex clustering does not zero out any features in this case, though proper feature weighing clearly helpsit to uncover the true cluster structure in the dataset. llll l lllll llll lll lllllll −10−50510 −10 0 10 t−SNE dimension 1 t − S N E d i m en s i on Label l (a) Ground Truth llll l lllll llll lll lllllll l lll llllllllllllllllllll ll lllllllll l lllllllll lllllll lll lllll lllll l lll llllll llll lllll ll llllllllllll lll llll llll llll l lll lll −10−50510 −10 0 10 t−SNE dimension 1 t − S N E d i m en s i on Label llllllll (b) Convex Clustering llll l lllll llll lll lllllllllll lll lllll lllll l lll lllll lll llll llll llll l lll lll −10−50510 −10 0 10 t−SNE dimension 1 t − S N E d i m en s i on Label l ABC (c) Sparse Convex llll l llllllll lll lllllll −10−50510 −10 0 10 t−SNE dimension 1 t − S N E d i m en s i on Label l ABCDE (d) Biconvex l lll lll −10−50510 −10 0 10 t−SNE dimension 1 t − S N E d i m en s i on Label l ABCDEF (e)

Sparse Hierarchical llll lll lll llll lll l ll ll −10−50510 −10 0 10 t−SNE dimension 1 t − S N E d i m en s i on Label l ABCDEF (f) Sparse k -means lllllllll llll ll lll −10−50510 −10 0 10 t−SNE dimension 1 t − S N E d i m en s i on Label l ABCDEF (g) Average Linkage

Figure 8: Comparison of solutions in t-SNE embedding, Libras dataset.19 .2 Leukemia data

We next revisit a classic DNA microarray dataset from a study of leukemia on human subjects. The dataare collected and described by Golub et al. (1999) and after a standard preprocessing, consist of 3571gene expression levels collected over 72 samples (comprised of 62 bone marrow samples and 10 from theperipheral blood). Out of the 72 samples, 47 correspond to acute lymphoblastic leukemia (ALL) and 25 toacute myeloblastic leukemia (AML). The observations are centered and scaled before use.To distinguish between these two classes on the basis of gene expression, we apply biconvex clustering withhyperparameters λ = 0 .

01 and γ = 0 .

1. At convergence, our method selects 456 relevant genes among theoriginal space of 3571 features, and is able to correctly classify all but one one point. Prior work has suggestedthat this point is likely a potential outlier (Chakraborty et al., 2020a). In this case study, we emphasizethe success of gene selection in addition to recovering the ground truth partition almost perfectly. Among50 genes that were deemed potentially relevant to distinguishing between leukemias in the original study byGolub et al. (1999), 46 are also selected by our algorithm. A later study by Chow et al. (2001) revisits thedata with a focus on identifying the discriminative genes, employing heuristic criteria based on a mix of threerelevance measures: a naive Bayes score, a median vote, and mean aggregate relevance. These measuresare combined in a supervised learning framework based on Support Vector Machines (SVMs) that requirestrue labels as inputs. They reported that the top 50 genes under each of the relevance measures had littleoverlap; we ﬁnd all genes in the intersection of the three criteria are selected by our method. In particular,the genes deemed most signiﬁcant in this prior study— Adipsin (M84526) and Cystatin C (M27891)— wereassigned the 3rd and 7th largest feature weights among the 456 selected by biconvex clustering. Whileﬁndings from neither of these previous studies should be taken as ground truth, it is reassuring that theirresults are consistent with our feature selection results. Though it is perhaps unsurprising that our methodmay outperform those applied to these data at that time— these studies predated the sparse clusteringframework of Witten and Tibshirani (2010)—our simultaneous clustering and feature weighing method oﬀersnew insights in a reexamination of the data. A detailed list of the top 20 genes identiﬁed by our method isincluded in Table 2, and visualization of solutions via t-SNE is provided in Figure 11 of the Appendix.

The proposed methodology seeks to address three chief challenges in clustering: high-dimensionality viafeature weighing and selection, sensitivity to initialization via a biconvex formulation, and selection of clustercenters via a penalty formulation that encourages centroids to agglomerate along the solution path. Biconvexclustering can be seen as occupying a middle ground between the convex clustering objective and the originalnon-convex objective arising from hierarchical clustering. We build upon recent algorithmic developmentsfor convex clustering and its variants, bridging these insights to good intuition established in the classicalclustering literature. 20 ene ID Gene Annotation Feature WeightD88270 at pre-B lymphocyte gene 1, VPREB1 0.008134539D88422 at Cystatin A 0.010030157M11722 at Terminal transferase mRNA 0.007699765M23197 at CD33 antigen (diﬀerentiation antigen) 0.010011479M27891 at CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage) 0.012265694M28826 at CD1B antigen (thymocyte antigen) 0.050758810M63138 at CTSD Cathepsin D (lysosomal aspartyl protease) 0.007746540M84526 at DF D component of complement (adipsin) 0.014803023M89957 at IGB Immunoglobulin-associated beta (B29) 0.008915087U05259 rna1 at MB-1 gene 0.010192411U46499 at Gluthanione S-Transferase, Microsoma 0.013318948X04145 at CD3G antigen, gamma polypeptide (TiT3 complex) 0.010848832X14975 at GB DEF5CD1 R2 gene for MHC-related antigen 0.011330957X58529 at Immunoglobulin mu, part of exon 8 0.009955146X87241 at HFat protein 0.012668398X95735 at Zyxin 0.009213922Z15115 at TOP2B Topoisomerase (DNA) II beta (180kD) 0.009112349M13560 s at Probable Protein Disulﬁde Isomerase ER-60 Precursor 0.008775355X76223 s at GB DEF5MAL gene exon 4 0.015623870M31523 at TCF3 Transcription factor 3 0.013837447(E2A immunoglobulin enhancer binding factors E12/E47)

Table 2: Top 20 genes identiﬁed by biconvex clustering on leukemia dataset, along with their gene IDs,annotations and feature weights. 21ot only does feature selection reduce dimension while retaining interpretability of the features (com-pared to a generic dimension reduction pre-processing step), but feature weights obtained by the algorithmimprove clustering quality while serving as informative and interpretable quantities themselves. Our al-gorithm returns simple alternating updates that resemble classical clustering algorithms, while conferringseveral advantages over traditional approaches. In particular, we show that sacriﬁcing convexity and exactcoalescence of centroids enables algorithmic and theoretical beneﬁts that we consider well worth the tradeoﬀ.Fruitful avenues for future work include extending the framework to tensor settings (Sun and Li, 2019), andexploring other optimization frameworks such as semi-smooth Newton (Yuan et al., 2018; Sun et al., 2018)and stochastic descent algorithms (Panahi et al., 2017) that may lead to further computational gains. Simi-lar penalties involving k -NN aﬃnities within a fusion penalty have recently been studied for non-parametricregression (Madrid Padilla et al., 2020). Similar to their beneﬁts in our setting, the k -NN terms enable“manifold adaptivity”, in addition to the fusion term which provides local adaptivity. It will be fruitful toexplore the extent to which our contributions are useful in related contexts such as regression and trendﬁltering, as well as to potentially import their theoretical insights into the clustering setting. References

Aggarwal, C. C., Hinneburg, A., and Keim, D. A. (2001). On the surprising behavior of distance metrics inhigh dimensional space. In

International Conference on Database Theory , pages 420–434. Springer.Bachem, O., Lucic, M., Hassani, H., and Krause, A. (2016). Fast and provably good seedings for k-means.In

Advances in Neural Information Processing Systems , pages 55–63.Chakraborty, S. and Das, S. (2019). A strongly consistent sparse k -means clustering with direct l penaliza-tion on variable weights. arXiv preprint arXiv:1903.10039 .Chakraborty, S., Paul, D., and Das, S. (2020a). Hierarchical clustering with optimal transport. Statistics &Probability Letters , page 108781.Chakraborty, S., Paul, D., Das, S., and Xu, J. (2020b). Entropy weighted power k-means clustering. In

International Conference on Artiﬁcial Intelligence and Statistics , pages 691–701.Chi, E. C., Gaines, B. R., Sun, W. W., Zhou, H., and Yang, J. (2018). Provable convex co-clustering oftensors. arXiv preprint arXiv:1803.06518 .Chi, E. C. and Lange, K. (2015). Splitting methods for convex clustering.

Journal of Computational andGraphical Statistics , 24(4):994–1013.Chi, E. C. and Steinerberger, S. (2019). Recovering trees with convex clustering.

SIAM Journal on Mathe-matics of Data Science , 1(3):383–407. 22how, M., Moler, E., and Mian, I. (2001). Identifying marker genes in transcription proﬁling data using amixture of feature relevance experts.

Physiological Genomics , 5(2):99–111.de Amorim, R. C. (2016). A survey on feature weighting based k-means algorithms.

Journal of Classiﬁcation ,33(2):210–242.De Amorim, R. C. and Mirkin, B. (2012). Minkowski metric, feature weighting and anomalous clusterinitializing in k-means clustering.

Pattern Recognition , 45(3):1061–1075.DeSarbo, W. S., Carroll, J. D., Clark, L. A., and Green, P. E. (1984). Synthesized clustering: A methodfor amalgamating alternative clustering bases with diﬀerential weighting of variables.

Psychometrika ,49(1):57–78.Dua, D. and Graﬀ, C. (2017). UCI Machine Learning Repository.Fang, Y. and Wang, J. (2012). Selection of the number of clusters via the bootstrap method.

ComputationalStatistics & Data Analysis , 56(3):468–477.Friedman, J. H. and Meulman, J. J. (2004). Clustering objects on subsets of attributes (with discussion).

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 66(4):815–849.Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L.,Downing, J. R., Caligiuri, M. A., et al. (1999). Molecular classiﬁcation of cancer: class discovery and classprediction by gene expression monitoring.

Science , 286(5439):531–537.Gorski, J., Pfeuﬀer, F., and Klamroth, K. (2007). Biconvex sets and optimization with biconvex functions:a survey and extensions.

Mathematical Methods of Operations Research , 66(3):373–407.Hamerly, G. and Elkan, C. (2004). Learning the k in k-means. In

Advances in Neural Information ProcessingSystems , pages 281–288.Hanson, D. L. and Wright, F. T. (1971). A bound on tail probabilities for quadratic forms in independentrandom variables.

The Annals of Mathematical Statistics , 42(3):1079–1083.Hebiri, M., Van De Geer, S., et al. (2011). The smooth-lasso and other (cid:96) + (cid:96) -penalized methods. ElectronicJournal of Statistics , 5:1184–1226.Hocking, T. D., Joulin, A., Bach, F., and Vert, J.-P. (2011). Clusterpath: An algorithm for clustering usingconvex fusion penalties. In

Proceedings of the 28th International Conference on International Conferenceon Machine Learning , ICML’11, page 745–752, Madison, WI, USA. Omnipress.Huang, J. Z., Ng, M. K., Hongqiang Rong, and Zichen Li (2005). Automated variable weighting in k-meanstype clustering.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 27(5):657–668.23ubert, L. and Arabie, P. (1985). Comparing partitions.

Journal of classiﬁcation , 2(1):193–218.Jain, A. K. (2010). Data clustering: 50 years beyond k-means.

Pattern Recognition Letters , 31(8):651–666.Jing, L., Ng, M. K., and Huang, J. Z. (2007). An entropy weighting k-means algorithm for subspace clusteringof high-dimensional sparse data.

IEEE Transactions on Knowledge and Data Engineering , 19(8):1026–1041.Lam, K. Y., Westrick, Z. M., M¨uller, C. L., Christiaen, L., and Bonneau, R. (2016). Fused regression formulti-source gene regulatory network inference.

PLoS Computational Biology , 12(12):e1005157.Land, S. R. and Friedman, J. H. (1997). Variable fusion: A new adaptive signal regression method. Technicalreport, Technical Report 656, Department of Statistics, Carnegie Mellon University . . . .Langfelder, P., Zhang, B., and Horvath, S. (2008). Deﬁning clusters from a hierarchical cluster tree: thedynamic tree cut package for R.

Bioinformatics , 24(5):719–720.Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In

Advances in NeuralInformation Processing Systems , pages 556–562.Lindsten, F., Ohlsson, H., and Ljung, L. (2011). Clustering using sum-of-norms regularization: With appli-cation to particle ﬁlter output computation. In ,pages 201–204. IEEE.Lloyd, S. (1982). Least squares quantization in PCM.

IEEE Transactions on Information Theory , 28(2):129–137.Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-SNE.

Journal of Machine Learning Research ,9(Nov):2579–2605.MacQueen, J. et al. (1967). Some methods for classiﬁcation and analysis of multivariate observations. In

Proceedings of the ﬁfth Berkeley symposium on mathematical statistics and probability , volume 1, pages281–297. Oakland, CA, USA.Madrid Padilla, O. H., Sharpnack, J., Chen, Y., and Witten, D. M. (2020). Adaptive nonparametric regres-sion with the k-nearest neighbour fused lasso.

Biometrika , 107(2):293–310.Modha, D. S. and Spangler, W. S. (2003). Feature weighting in k-means clustering.

Machine Learning ,52(3):217–237.Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. (2013). The eﬀectiveness of Lloyd-type methodsfor the k-means problem.

Journal of the ACM (JACM) , 59(6):1–22.24anahi, A., Dubhashi, D., Johansson, F. D., and Bhattacharyya, C. (2017). Clustering by sum of norms:Stochastic incremental algorithm, convergence and cluster recovery. In

International Conference on Ma-chine Learning , pages 2769–2777.Park, C., Choi, H., Delcher, C., Wang, Y., and Yoon, Y. J. (2019). Convex clustering analysis for histogram-valued data.

Biometrics , 75(2):603–612.Pelckmans, K., De Brabanter, J., Suykens, J. A., and De Moor, B. (2005). Convex clustering shrinkage. In

PASCAL Workshop on Statistics and Optimization of Clustering Workshop .Price, B. S., Geyer, C. J., and Rothman, A. J. (2015). Ridge fusion in statistical learning.

Journal ofComputational and Graphical Statistics , 24(2):439–454.Radchenko, P. and Mukherjee, G. (2017). Convex clustering via (cid:96) fusion penalization. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 79(5):1527–1546.Sun, D., Toh, K.-C., and Yuan, Y. (2018). Convex clustering: Model, theoretical guarantee and eﬃcientalgorithm. arXiv preprint arXiv:1810.02677 .Sun, W. W. and Li, L. (2019). Dynamic tensor clustering.

Journal of the American Statistical Association ,pages 1–28.Tan, K. M. and Witten, D. (2015). Statistical properties of convex clustering.

Electronic journal of statistics ,9(2):2324.Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and smoothness via thefused lasso.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67(1):91–108.Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a data set via thegap statistic.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 63(2):411–423.Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in noise.

IEEETransactions on Information Theory , 52(3):1030–1051.Tseng, P. (2001). Convergence of a block coordinate descent method for nondiﬀerentiable minimization.

Journal of Optimization Theory and Applications , 109(3):475–494.Wang, B., Zhang, Y., Sun, W. W., and Fang, Y. (2018). Sparse convex clustering.

Journal of Computationaland Graphical Statistics , 27(2):393–403.Wang, M. and Allen, G. I. (2019). Integrative generalized convex clustering optimization and feature selectionfor mixed multi-view data. arXiv preprint arXiv:1912.05449 .Witten, D. M. and Tibshirani, R. (2010). A framework for feature selection in clustering.

Journal of theAmerican Statistical Association , 105(490):713–726.25u, T., Benson, A. R., and Gleich, D. F. (2016). General tensor spectral co-clustering for higher-order data.In

Advances in Neural Information Processing Systems , pages 2559–2567.Xu, J. and Lange, K. (2019). Power k-means clustering. In Chaudhuri, K. and Salakhutdinov, R., edi-tors,

Proceedings of the 36th International Conference on Machine Learning , volume 97 of

Proceedings ofMachine Learning Research , pages 6921–6931, Long Beach, California, USA. PMLR.Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 68(1):49–67.Yuan, Y., Sun, D., and Toh, K.-C. (2018). An eﬃcient semismooth Newton based algorithm for convexclustering. In

International Conference on Machine Learning , pages 5718–5726.Zhu, C., Xu, H., Leng, C., and Yan, S. (2014). Convex optimization procedure for clustering: theoreticalrevisit. In

Advances in Neural Information Processing Systems , pages 1619–1627.Zou, H. (2006). The adaptive lasso and its oracle properties.

Journal of the American Statistical Association ,101(476):1418–1429. 26 ppendix

A Proof of Theorem 1

We now discuss the solution of the optimization problem P , deﬁned in Section 2.2. Since µ is kept ﬁxedin this subproblem, we deﬁne the constant D l = (cid:80) ni =1 ( x il − µ il ) for brevity. Thus f ( µ , w ) (equation (10))can be we written as g ( w ) = p (cid:88) l =1 ( w l + λw l ) D l + (constant in w ) . (15)Equation (15) is to be minimized subject to the constraints (cid:80) pl =1 w l = 1 and w l ≥

0, for all l = 1 , . . . , p . Wewrite the Lagrangian for this problem as follows: L = p (cid:88) l =1 ( w l + λw l ) D l − α ( p (cid:88) l =1 w l − − p (cid:88) l =1 ξ l w l From the Karush-Kuhn-Tucker (KKT) conditions of optimality, we have, ∂ L w l = 0 , , ∀ l = 1 , . . . , p. (16) p (cid:88) l =1 w l = 1 . (17) w l ≥ , ∀ l = 1 , . . . , p. (18) ξ l ≥ . (19) ξ l w l = 0 , ∀ l = 1 , . . . , p. (20)Equation (16) implies that w l = 12 (cid:18) α + ξ l D l − λ (cid:19) (21)We now consider the following cases. Case 1: αD l > λ : Since ξ l ≥

0, we conclude from equation (21) that w l >

0. Thus, from equation (20), wededuce that ξ l = 0. Plugging in this value in equation (21), we get that w l = (cid:18) αD l − λ (cid:19) . Case 2: αD l ≤ λ : Suppose w l >

0: then equation (20) implies that ξ l = 0. Thus, from equation (21), wewould have w l = (cid:18) αD l − λ (cid:19) ≤

0, which is a contradiction. Thus, in this case, we must have w l = 0.We now observe that since λ, D l ≥

0, we may conclude that α >

0. Otherwise, case 2 would always occurimplying w l = 0, ∀ l = 1 , . . . , p , which violates equation (17). Thus, case 1 holds and the slack variables ξ l = 0. Together wtih equation (21), we observe that w l = S (cid:0) αD l , λ (cid:1) . Moreover, together with constraint(17), we observe that α must satisfy the equation:12 S (cid:18) αD l , λ (cid:19) = 1 . This concludes the proof. 27

Proof of Theorem 2

Theorem 2

Suppose x = u + (cid:15) , where, (cid:15) ∈ R np is a vector of independent sub-gaussian random variables,with mean 0 and variance σ . Suppose that ˆ u and ˆ W are obtained form minimizing (14), then if γ (cid:48) > σ (1 + λ ) (cid:114) log( p ( n ) ) np np (cid:107) ˆ u − u (cid:107) W ≤ σ (1 + λ ) (cid:20) n + (cid:115) log( np ) n p (cid:21) + γ (1 − n ) + γ (cid:48) n (cid:88) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) holds with probability at least 1 − p ( n ) − exp (cid:18) − min (cid:26) c log( np ) , c (cid:112) p log( np ) (cid:27)(cid:19) . Proof.

Let D = A Λ V β be the singular value decomposition (SVD) of D. Here V β ∈ R np × p ( n − . We canconstruct V α ∈ R np × p such that V = [ V α , V β ] is an np × np orthonormal matrix. Thus, V (cid:62) V = V V (cid:62) = I and V (cid:62) α V β = 0.Let β = V β u and α = V α u . Suppose γ (cid:48) = γnp . Thus the optimization problem (14) becomesmin α , β ,W np ( x − V α α − V β β ) (cid:62) W ( x − V α α − V β β ) + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) D u (cid:107) (22)Let, ˆ α , ˆ β , ˆ W be the minimizers of (22). We ﬁrst observe that ˆ u = V α ˆ α + V β ˆ β . Let Z = A Λ. To simplifynotations, we write (cid:107) y (cid:107) W = y (cid:62) W y . By deﬁnition, we have12 np (cid:107) ( x − V α ˆ α − V β ˆ β ) (cid:107) W + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) Z ˆ β (cid:107) ≤ np (cid:107) ( x − V α α − V β β ) (cid:107) W + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) Z β (cid:107) (23)Substituting x by u + (cid:15) , we get,12 np (cid:107) V α ( ˆ α − α ) + V β (ˆ β − β ) (cid:107) W + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) Z ˆ β (cid:107) ≤ np G ( ˆ α , ˆ β , ˆ W ) + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) Z β (cid:107) , (24)where, G ( ˆ α , ˆ β , ˆ W ) = (cid:15) (cid:62) ˆ W ( V α ( ˆ α − α ) + V β (ˆ β − β )). Since ˆ α is the minimizer of (22), we can choose ˆ α suchthat x − V β ˆ β − V α ˆ α = 0. Thus we can choose ˆ α such that,ˆ α = V (cid:62) α ( x − V β ˆ β )= V (cid:62) α ( u + (cid:15) − V β ˆ β ) = α + V (cid:62) α (cid:15) . Thus, we have,1 np | G ( ˆ α , ˆ β , ˆ W ) | = 1 np | (cid:15) (cid:62) ˆ W [ V α ( ˆ α − α ) + V β (ˆ β − β )] | = 1 np | (cid:15) (cid:62) ˆ W [ V α V (cid:62) α (cid:15) + V β (ˆ β − β )] |≤ np (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) + 1 np | (cid:15) (cid:62) ˆ W V β (ˆ β − β ) | = 1 np (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) + 1 np | (cid:15) (cid:62) ˆ W V β Z † Z (ˆ β − β ) |

28 1 np (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) + 1 np (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i (cid:54) = j (cid:15) (cid:62) ˆ W V β Z †C ( i,j ) Z C ( i,j ) (ˆ β − β ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ np (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) + 1 np (cid:88) i (cid:54) = j (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) (cid:62) ˆ W V β Z †C ( i,j ) Z C ( i,j ) (ˆ β − β ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ np (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) + 1 np (cid:88) i (cid:54) = j (cid:107) (cid:15) (cid:62) ˆ W V β Z †C ( i,j ) (cid:107) (cid:107) Z C ( i,j ) (ˆ β − β ) (cid:107) ≤ np (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) + 1 np max i (cid:54) = j (cid:107) (cid:15) (cid:62) ˆ W V β Z †C ( i,j ) (cid:107) (cid:88) i (cid:54) = j (cid:107) Z C ( i,j ) (ˆ β − β ) (cid:107) Next, we derive high-probability bounds for the terms np (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) and max i (cid:54) = j (cid:107) (cid:15) (cid:62) ˆ W V β Z †C ( i,j ) (cid:107) . Bounds for np (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) : (cid:107) ˆ W V α V (cid:62) α (cid:107) sp ≤ (cid:107) ˆ W (cid:107) sp (cid:107) V α V (cid:62) α (cid:107) sp ≤ (1 + λ ). Also note that (cid:107) ˆ W V α V (cid:62) α (cid:107) F = tr ( ˆ W V α V (cid:62) α V α V (cid:62) α ˆ W )= tr ( ˆ W V α V (cid:62) α ˆ W )= tr ( ˆ W V α V (cid:62) α )= tr ( ˆ W p (cid:88) i =1 v i v (cid:62) i )= p (cid:88) i =1 tr ( ˆ W v i v (cid:62) i )= p (cid:88) i =1 tr ( v (cid:62) i ˆ W v i ) ≤ p (cid:88) i =1 Λ max ( ˆ W ) ≤ p (1 + λ ) . By Lemma 1, we have, P ( (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) ≥ t + (1 + λ ) σ p ) ≤ P ( (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) ≥ t + σ tr ( ˆ W V α V (cid:62) α )) ≤ exp (cid:26) − min (cid:18) c t σ (cid:107) M (cid:107) F , c tσ (cid:107) M (cid:107) sp (cid:19)(cid:27) ≤ exp (cid:26) − min (cid:18) c t σ p (1 + λ ) , c t (1 + λ ) σ (cid:19)(cid:27) Now if we take t = σ (1 + λ ) (cid:112) p log( np ), we have P (cid:18) np (cid:15) (cid:62) ˆ W V α V (cid:62) α (cid:15) ≥ σ (1 + λ ) (cid:20) n + (cid:115) log( np ) n p (cid:21)(cid:19) ≤ exp (cid:18) − min (cid:26) c log( np ) , c (cid:112) p log( np ) (cid:27)(cid:19) (25) Bounds for max i (cid:54) = j (cid:107) (cid:15) (cid:62) ˆ W V β Z †C ( i,j ) (cid:107) Let e j be the j -th coordinate vector of length p (cid:0) n (cid:1) . Let y j = e (cid:62) j ( Z †C ( i,j ) ) (cid:62) V (cid:62) β ˆ W (cid:15) . Now note that Λ max ( ˆ W ) ≤ (1 + λ ), Λ max ( V β ) = 1 and Λ max ( Z † ) = √ n . Thus y j is a29ne-dimensional sub-Gaussian random variable, with mean 0 and variance at most σ (1+ λ ) n . Thus, P (max j | y j | ≥ z ) ≤ p (cid:18) n (cid:19) P ( | y j | ≥ z ) ≤ p (cid:18) n (cid:19) exp (cid:26) − nz σ (1 + λ ) (cid:27) (26)Now, if we choose z = 2 σ (1 + λ ) (cid:113) log( p ( n ) ) n , we get P (cid:18) max i (cid:54) = j (cid:107) (cid:15) (cid:62) ˆ W V β Z † (cid:107) ∞ ≥ σ (1 + λ ) (cid:115) log( p (cid:0) n (cid:1) ) n (cid:19) = P (cid:18) max j | y j | ≥ σ (1 + λ ) (cid:115) log( p (cid:0) n (cid:1) ) n (cid:19) ≤ p (cid:0) n (cid:1) (27)Now since there are only p indices in C ( i, j ), we have (cid:107) (cid:15) (cid:62) ˆ W V β Z †C ( i, j ) (cid:107) ≤ √ p (cid:107) (cid:15) (cid:62) ˆ W V β Z †C ( i, j ) (cid:107) ∞ . Observe that1 np max i (cid:54) = j (cid:107) (cid:15) (cid:62) ˆ W V β Z †C ( i, j ) (cid:107) ≤ (cid:112) n p max i (cid:54) = j (cid:107) (cid:15) (cid:62) ˆ W V β Z †C ( i, j ) (cid:107) ∞ = 1 (cid:112) n p (cid:107) (cid:15) (cid:62) ˆ W V β Z † (cid:107) ∞ . Thus, P (cid:18) np max i (cid:54) = j (cid:107) (cid:15) (cid:62) ˆ W V β Z †C ( i, j ) (cid:107) ≥ σ (1 + λ ) (cid:115) log( p (cid:0) n (cid:1) ) n p (cid:19) ≤ P (cid:18) max i (cid:54) = j (cid:107) (cid:15) (cid:62) ˆ W V β Z † (cid:107) ∞ ≥ σ (1 + λ ) (cid:115) log( p (cid:0) n (cid:1) ) n (cid:19) ≤ p (cid:0) n (cid:1) . We see that when γ (cid:48) > σ (1 + λ ) (cid:114) log( p ( n ) ) np , we have P (cid:18) γ (cid:48) n < np max i (cid:54) = j (cid:107) (cid:15) (cid:62) ˆ W V β Z †C ( i, j ) (cid:107) (cid:19) ≤ p (cid:0) n (cid:1) (28)Combining equations (25) and (28) implies that1 np | G ( ˆ α , ˆ β , ˆ( W )) | ≤ σ (1 + λ ) (cid:20) n + (cid:115) log( np ) n p (cid:21) + γ (cid:48) n (cid:88) i (cid:54) = j (cid:107) Z C ( i,j ) (ˆ β − β ) (cid:107) (29)holds with probability at least 1 − p ( n ) − exp (cid:18) − min (cid:26) c log( np ) , c (cid:112) p log( np ) (cid:27)(cid:19) . Thus, from equation(24) and (29), we arrive at the inequality12 np (cid:107) V α ( ˆ α − α ) + V β (ˆ β − β ) (cid:107) W + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) Z C ( i,j ) ˆ β (cid:107) ≤ σ (1 + λ ) (cid:20) n + (cid:115) log( np ) n p (cid:21) + γ (cid:48) n (cid:88) i (cid:54) = j (cid:107) Z C ( i,j ) (ˆ β − β ) (cid:107) + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) Z C ( i,j ) β (cid:107) np (cid:107) V α ( ˆ α − α ) + V β (ˆ β − β ) (cid:107) W ≤ σ (1 + λ ) (cid:20) n + (cid:115) log( np ) n p (cid:21) + γ (cid:48) (cid:88) i (cid:54) = j (cid:20) (cid:107) Z C ( i,j ) ˆ β (cid:107) n − (cid:107) Z C ( i,j ) ˆ β (cid:107) (cid:21) + γ (cid:48) n (cid:88) i (cid:54) = j (cid:107) Z C ( i,j ) β (cid:107) + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) Z C ( i,j ) β (cid:107) ≤ σ (1 + λ ) (cid:20) n + (cid:115) log( np ) n p (cid:21) + γ (cid:48) n ( n −

1) 1 n + γ (cid:48) n (cid:88) i (cid:54) = j (cid:107) Z C ( i,j ) β (cid:107) + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) Z C ( i,j ) β (cid:107) ≤ σ (1 + λ ) (cid:20) n + (cid:115) log( np ) n p (cid:21) + γ (cid:48) (1 − n ) + γ (cid:48) n (cid:88) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) + γ (cid:48) (cid:88) i (cid:54) = j (cid:107) D C ( i,j ) u (cid:107) C Additional ﬁgures and tables

C.1 Dendrograms, simulation study 4.2 . . . . . . Observations H e i gh t (a) Biconvex . . . . Observations H e i gh t (b) Sparse Hierarchical Clustering Figure 9: Dendrograms, color coded with the ground truth, produced by biconvex clustering and sparsehierarchical clustering algorithms on the simulated data in Section 4.2. Biconvex clustering perfectly recoversthe ground truth using a dynamic tree cut, and it is clear that this can be achieved for a generous rangeof heights. In contrast sparse hierarchical clustering failed to recover the true clusters, and the dendrogramstructure in panel (b) provides insight on why this is the case.31 .2 Sensitivity to Random Restarts

In this section, we will study the eﬀects of random initialization of the feature weights. We generated threedatasets each with the ﬁrst ﬁve features revealing the cluster structure of the dataset. We then append d many features to the dataset, each generated from a standard normal distribution. The datasets are generatedusing the procedures described in Section 4.1. For each of the datasets, we start with a randomly choseninitialization of feature weights and iterate until convergence. We initialize each of the feature weights from U nif (0 , w . lllll lllll lllll lllll llll Features F ea t u r e W e i gh t s (a) ll l l l l Features F ea t u r e W e i gh t s (b) l l Features F ea t u r e W e i gh t s (c) Figure 10: Box plot of top ranked features selected by BCC algorithm under random initialization of w as the number of irrelevant features (d) varies (detailed in Section C.2). We see the performance of thealgorithm appears to be fairly stable even in high dimensions. C.3 Case Study on Leukemia Data ll ll l lll ll ll ll ll llll lll ll ll ll ll ll lll ll llll l l ll −2000200 −400 −200 0 200 400 t−SNE dimension 1 t − S N E d i m en s i on (a) Ground Truth lll ll l lll ll ll ll ll llll lll ll ll ll ll ll lll ll ll lll l l ll −2000200 −400 −200 0 200 400 t−SNE dimension 1 t − S N E d i m en s i on (b) Biconvex ll ll ll l llll l l llll ll ll lll lllll l ll −2000200 −400 −200 0 200 400 t−SNE dimension 1 t − S N E d i m en s i on (c) Sparse k -means-means