Communication-efficient k-Means for Edge-based Machine Learning
Hanlin Lu, Ting He, Shiqiang Wang, Changchang Liu, Mehrdad Mahdavi, Vijaykrishnan Narayanan, Kevin S. Chan, Stephen Pasteris
aa r X i v : . [ c s . L G ] F e b Communication-efficient k -Means forEdge-based Machine Learning Hanlin Lu, Ting He,
Senior Member, IEEE,
Shiqiang Wang, Changchang Liu,Mehrdad Mahdavi, Vijaykrishnan Narayanan,
Fellow, IEEE,
Kevin S. Chan,
Senior Member, IEEE,
Stephen Pasteris
Abstract
We consider the problem of computing the k -means centers for a large high-dimensional dataset inthe context of edge-based machine learning, where data sources offload machine learning computationto nearby edge servers. k -Means computation is fundamental to many data analytics, and the capabilityof computing provably accurate k -means centers by leveraging the computation power of the edgeservers, at a low communication and computation cost to the data sources, will greatly improve theperformance of these analytics. We propose to let the data sources send small summaries, generatedby joint dimensionality reduction (DR) and cardinality reduction (CR), to support approximate k -means computation at reduced complexity and communication cost. By analyzing the complexity, thecommunication cost, and the approximation error of k -means algorithms based on state-of-the-art DR/CRmethods, we show that: (i) it is possible to achieve a near-optimal approximation at a near-linearcomplexity and a constant or logarithmic communication cost, (ii) the order of applying DR and CRsignificantly affects the complexity and the communication cost, and (iii) combining DR/CR methodswith a properly configured quantizer can further reduce the communication cost without compromisingthe other performance metrics. Our findings are validated through experiments based on real datasets. Index Terms
Coreset, dimensionality reduction, random projection, k-means, edge-based machine learning.
H. Lu, T. He, M. Mahdavi, and V. Narayanan are with Pennsylvania State University, University Park, PA 16802, USA (email: { hzl263,tzh58, mzm616, vxn9 } @psu.edu). S. Wang and C. Liu are with IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA(email: { wangshiq@us., Changchang.Liu33@ } ibm.com). K. Chan is with Army Research Laboratory, Adelphi, MD 20783, USA (email:[email protected]). S. Pasteris is with University College London, London WC1E 6EA, UK (email: [email protected]).A preliminary version of this work was presented at ICDCS’20. [1].This research was partly sponsored by the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agreement NumberW911NF-16-3-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted asrepresenting the official policies, either expressed or implied, of the U.S. Army Research Laboratory, the U.S. Government, the U.K. Ministryof Defence or the U.K. Government. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Governmentpurposes notwithstanding any copyright notation hereon. I. I
NTRODUCTION
Given a dataset P ⊂ R d with cardinality n , where both n ≫ and d ≫ , consider theproblem of finding k points X = { x i } ki =1 to minimize the following cost function :cost ( P, X ) := X p ∈ P min x i ∈ X k p − x i k . (1)This is the k -means clustering problem, and the points in X are called centers . Equivalently,the k -means clustering problem can be considered as the problem of finding the partition P = { P , . . . , P k } of P into k clusters that minimizes the following cost function:cost ( P ) := k X i =1 min x i ∈ R d X p ∈ P i k p − x i k . (2) k -Means clustering is one of the most widely-used machine learning techniques. Algorithmsfor k -means are used in many areas of data science, e.g., for data compression, quantization,hashing; see the survey in [2] for more details. Recently, it was shown in [3], [4] that the centersof k -means can be used as a proxy (called a coreset ) of the original dataset in computing a broaderset of machine learning models with sufficiently continuous cost functions. Thus, efficient andaccurate computation of k -means can bring broad benefits to machine learning applications.However, solving k -means is non-trivial. The problem is known to be NP-hard, even for twocenters [5] or in the plane [6]. Due to its fundamental importance, how to speed up the k -means computation for large datasets has received significant attention. Existing solutions canbe classified into two approaches: dimensionality reduction (DR) techniques that aim at running k -means on a “thinner” dataset with a reduced dimension [7], and cardinality reduction (CR) techniques that aim at running k -means on a “smaller” dataset with a reduced number of points[8]. However, these solutions only considered the computation cost, while implicitly assumingthat the data is at the same location where the k -means computation is performed.To our knowledge, we are the first to explicitly analyze the communication cost in computing k -means over remote (and possibly distributed) data. The need of communications arises inthe emerging application scenario of edge-based machine learning [9], where mobile/wirelessdevices collect the raw data and transmit them to nearby edge servers for processing. Comparedto alternative approaches, e.g., transmitting the locally learned model parameters as in federatedlearning [10], transmitting data (summaries) has the advantage that: (i) only one round of The norms in (1) and (2) refer to the ℓ -2 norm. communications is required, (ii) the transmitted data can potentially be used to compute othermachine learning models [3], [4], and (iii) the edge server can solve the machine learning problemcloser to the optimality more easily than the data-collecting devices (within the same time).In this work, we consider the problem of solving k -means for a large high-dimensional dataset,i.e., n, d ≫ ( n : cardinality, d : dimension), residing at a data source (or sources) at the networkedge. An obvious solution of solving k -means at the data source and sending the centers to theserver will incur a high complexity at the data source, while another obvious solution of sendingthe data to the server and solving k -means there will incur a high communication cost. We seekto combine the best of the two by letting the data source send a small data summary generatedby efficient DR/CR methods and leaving the k -means computation to the server. A. Related Work
Our work belongs to the studies on data reduction for approximate k -means. Most existingsolutions can be classified into (i) dimensionality reduction and (ii) cardinality reduction .Dimensionality reduction (DR) for k -means, initiated by [12], aims at speeding up k -meansby reducing the number of features (i.e., the dimension). Two approaches have been proposed:1) feature selection that selects a subset of the original features, and 2) feature extraction thatconstructs a smaller set of new features. For feature selection, the best known algorithm in [13]achieves a (1 + ǫ ) -approximation using a random sampling algorithm. For feature extraction,there are two methods with guaranteed approximation, based on singular value decomposition(SVD) [13] and random projections [7].Cardinality reduction (CR) for k -means, initiated by [14], aims at using a small weighted setof points in the same space, referred to as a coreset , to replace the original dataset. A coreset iscalled an ǫ -coreset (for k -means) if it can approximate the k -means cost of the original datasetfor every candidate set of centers up to a factor of ± ǫ . Many coreset construction algorithmshave been proposed for k -means. Most state-of-the-art coreset construction algorithms are basedon the sensitivity sampling framework [15], which needs a coreset cardinality of ˜ O ( kdǫ − ) .The best known solution is the one in [8, Theorem 36], which showed that the cardinality of an In cases that the raw data are spread over multiple nodes, another round of communications is needed to decide the sizesof data summaries to collect from each node [11]. However, each node only sends one scalar in this round and hence thecommunication cost is negligible. We use ˜ O ( x ) to denote a value that is at most linear in x times a factor that is polylogarithmic in x . ǫ -coreset can be reduced to ˜ O ( k ǫ − ) . Note that [8] is the full version of [16], and thus we willrefer to [8] instead of [16].In the distributed setting, [11] proposed a distributed version of sensitivity sampling to con-struct an ǫ -coreset over a distributed dataset, and [17] further combined this algorithm with adistributed PCA algorithm from [8]. Besides these theoretical results, there are also works onadapting centralized k -means algorithms for distributed settings, e.g., MapReduce [18], sensornetworks [19], and Peer-to-Peer networks [20]. However, these algorithms are only heuristics,while we focus on algorithms with guaranteed approximation errors.Among the above works, only [8], [17] considered joint DR and CR for k -means. However,they blindly assumed that DR should be applied before CR, leaving open several importantquestions: 1) Is it possible to achieve the same approximation error at a lower complexity orcommunication cost? 2) Does the order of applying DR and CR matter? 3) Will repeated DRand CR help? We will address all these questions.Besides DR and CR, quantization [21] can also reduce the communication cost by representingeach data point by a smaller number of bits. Existing quantizers can be classified into scalarquantizers and vector quantizers , where scaler quantizers apply quantization operations to eachattribute of each data point, and vector quantizers [22] apply quantization to each data point as awhole. While k -means has been used to design quantizers, we will show that simpler quantizerscan be combined with DR/CR methods to compute approximate k -means at a lower cost. Insteadof designing new quantizers, our focus is on how to properly configure an approximate k -meansalgorithm involving DR, CR, and quantization, which has not been addressed in the literature. B. Summary of Contributions
Our contribution is four-fold:1) If the data reside at a single data source, we show that (i) it is possible to solve k -meansarbitrarily close to the optimal with constant communication cost and near-linear complexity atthe data source, (ii) the order of applying DR and CR methods will not affect the approximationerror, but will lead to different tradeoffs between communication cost and complexity, and (iii)repeating DR both before and after CR can further improve the performance.2) If the data are distributed over multiple data sources, we show that it is possible to solve k -means arbitrarily close to the optimal with near-linear complexity at the data sources and a total communication cost that is logarithmic in the data size, achieved by combining a data-obliviousDR method with a state-of-the-art CR method in a particular order.3) We further extend our solution to include quantization. Using the rounding-based quan-tizer as a concrete example, we demonstrate how to configure the quantizer to minimize thecommunication cost while guaranteeing the k -means approximation error.4) Through experiments on real datasets, we verify that (i) joint DR and CR can effec-tively reduce the communication cost without incurring a high complexity at the data sourcesor significantly degrading the solution quality, (ii) the proposed joint DR-CR algorithms canachieve a solution quality similar to state-of-the-art algorithms while significantly reducing thecommunication cost and the complexity, and (iii) combining DR/CR with quantization can furtherreduce the communication cost without compromising the other performance metrics. Roadmap.
Section II reviews the background on DR/CR methods. Section III presents ourresults for the centralized setting. Section IV addresses the distributed setting. Section V evaluatesour solutions via experiments based on real datasets. Section VI presents further improvementvia joint DR, CR, and quantization. Finally, Section VII concludes the paper.II. B
ACKGROUND AND F ORMULATION
We start with a brief overview of existing results on DR and CR in support of k -means. A. Notations
We will use k x k to denote the ℓ -2 norm if x is a vector, or the Frobenius norm if x is amatrix. We will use A P ∈ R n × d to denote the matrix representation of a dataset P ⊂ R d , whereeach row corresponds to a data point. Let µ ( P ) denote the optimal -means center of P , whichis well-known to be the sample mean, i.e., µ ( P ) = | P | P p ∈ P p . Let P P,X denote the partitionof dataset P induced by centers X , i.e., P P,X = { P , . . . , P | X | } for P i := { p ∈ P : k p − x i k ≤k p − x j k , ∀ x j ∈ X \ { x i }} (ties broken arbitrarily). Given scalars x , y , and ǫ ( ǫ > ), we willuse x ≈ ǫ y to denote ǫ x ≤ y ≤ (1 + ǫ ) x . In our analysis, we use O ( x ) to denote a valuethat is at most linear in x , Ω( x ) to denote a value that is at least linear in x , and ˜ O ( x ) to denotea value that is at most linear in x times a factor that is polylogarithmic in x .Given a dimensionality reduction map π : R d → R d ′ ( d ′ < d ), we use π ( P ) := { π ( p ) : p ∈ P } to denote the output dataset for an input dataset P , and π ( P ) := { π ( P ) , . . . , π ( P k ) } to denotethe partition of π ( P ) corresponding to a partition P = { P , . . . , P k } of P . Moreover, given a partition P ′ = π ( P ) , we use π − ( P ′ ) to denote the corresponding partition of P , which puts p, q ∈ P into the same cluster if and only if π ( p ) , π ( q ) ∈ P ′ belong to the same cluster under P ′ . Finally, given P ′ = π ( P ) , we use π − ( P ′ ) := { π − ( p ′ ) : p ′ ∈ P ′ } to denote a set of pointsin R d that is mapped to P ′ by π . Note that there is no guarantee that π − ( P ′ ) = P . However,solutions to π ( ˜ P ) = P ′ must exist ( P is a feasible solution) and π − ( P ′ ) denotes an arbitrarysolution. If π is a linear map, i.e., π ( P ) := A P Π for a matrix Π ∈ R d × d ′ , then the Moore-Penroseinverse Π + [23] of Π gives a feasible solution π − ( P ′ ) := A P ′ Π + . B. Dimensionality Reduction for k -Means Definition II.1.
We say that a DR map π : R d → R d ′ ( d ′ < d ) is an ǫ -projection if it preservesthe cost of any partition up to a factor of ǫ , i.e., cost ( P ) ≈ ǫ cost ( π ( P )) for every partition P = { P , . . . , P k } of a finite set P ⊂ R d . One commonly used method to construct ǫ -projection is random projection, where the cor-nerstone result is the JL Lemma: Lemma II.1 ([24]) . There exists a family of random linear maps π : R d → R d ′ with the followingproperties: for every ǫ, δ ∈ (0 , / , there exists d ′ = O ( log(1 /δ ) ǫ ) such that for every d ≥ andall x ∈ R d , we have Pr {k π ( x ) k ≈ ǫ k x k} ≥ − δ. Based on this lemma, the best known result achieved by random projection is the following:
Theorem II.1 ([7]) . Consider any family of random linear maps π : R d → R d ′ that (i) satisfiesLemma II.1, and (ii) is sub-Gaussian-tailed (i.e., the probability for the norm after mapping tobe larger than the norm before mapping by a factor of at least t is bounded by e − Ω( d ′ t ) ).Then for every ǫ, δ ∈ (0 , / , there exists d ′ = O ( ǫ log kǫδ ) , such that π is an ǫ -projection withprobability at least − δ . There are many known methods to construct a random linear map that satisfies the conditions(i–ii) in Theorem II.1, e.g., maps defined by matrices with i.i.d. Gaussian and sub-Gaussianentries [25]–[27]. We will refer to such a random projection as a
JL projection . Remark:
Compared with PCA-based DR methods, JL projection has the advantage that theprojection matrix is data-oblivious , and can hence be pre-generated and distributed, or generatedindependently by different nodes using a shared random number generation seed, both incurring negligible communication cost at runtime. As is shown later, this can lead to significant savingsin the communication cost.
C. Cardinality Reduction for k -Means CR methods, also known as coreset construction algorithms , aim at constructing a smallerweighted dataset ( coreset ) with a bounded approximation error as follows.
Definition II.2 ([8]) . We say that a tuple ( S, ∆ , w ) , where S ⊂ R d , w : S → R , and ∆ ∈ R , is an ǫ -coreset of P ⊂ R d if it preserves the cost for every set of k centers up to a factor of ± ǫ , i.e., (1 − ǫ ) cost ( P, X ) ≤ cost ( S , X ) ≤ (1 + ǫ ) cost ( P, X ) (3) for any X ⊂ R d with | X | = k , wherecost ( S , X ) := X q ∈ S w ( q ) · min x i ∈ X k q − x i k + ∆ (4) denotes the k -means cost for a coreset S := ( S, ∆ , w ) and a set of centers X . We note that the above definition generalizes most of the existing definitions of ǫ -coreset,which typically ignore ∆ .The best known coreset construction algorithm for k -means was given in [8], which firstreduces the intrinsic dimension of the dataset by PCA, and then applies sensitivity sampling tothe dimension-reduced dataset to obtain an ǫ -coreset of the original dataset with a size that isconstant in n and d . Theorem II.2 ([8]) . For any ǫ, δ ∈ (0 , , with probability at least − δ , an ǫ -coreset ( S, ∆ , w ) of size | S | = O (cid:16) k log kǫ log( δ ) (cid:17) can be computed in time O (min( nd , n d ) + nkǫ − ( d + k log(1 /δ ))) . However, [8] only focused on minimizing the cardinality of coreset, ignoring the cost oftransmitting the coreset. As is shown later (Section IV-C), its proposed algorithm can be severelysuboptimal in the communication cost.
D. Problem Statement
The motivation of most existing DR/CR methods designed for k -means is to speed up k -means computation in a setting where the node holding the data is also the node computing k -means. In contrast, we want to develop efficient k -means algorithms in scenarios where the data generation and the k -means computation occur at different locations, such as in the case ofedge-based learning. We will refer to the node(s) holding the original data as the data source(s) ,and the node running k -means computation as the server .We will perform a holistic performance analysis by the following metrics: • Approximation error:
We say that a set of k -means centers X is an α -approximation ( α > )for k -means clustering of P if cost ( P, X ) ≤ α · cost ( P, X ∗ ) , where X ∗ is the optimal setof k -means centers for P . • Communication cost:
We say that an algorithm incurs a communication cost of y if a datasource employing the algorithm needs to send y scalars to the server. • Complexity:
We say that an algorithm incurs a complexity of z at the data source if a datasource employing the algorithm needs to perform z elementary operations.III. J OINT DR AND CR AT A S INGLE D ATA S OURCE
We will first focus on the scenario where all the data are at a single data source (the centralizedsetting ). We will show that: 1) treating given DR/CR methods as black boxes, fixed qualitiesof these methods will yield a fixed quality of the k -means solution computed from the reduceddataset, regardless of the order of applying these methods; 2) using suitably selected DR/CRmethods and a sufficiently powerful server, it is possible to solve k -means arbitrarily close tothe optimal, while incurring a constant communication cost and a near-linear complexity at thedata source; 3) repeated application of DR/CR can lead to a better communication-computationtradeoff than applying DR and CR only once. A. DR+CR
We first consider the approach of applying DR and then CR.
1) A Black-box Approach:
Given DR/CR methods as black boxes, our analysis needs thefollowing lemma to show the relationship between Equation (1) and Equation (2).
Lemma III.1.
The two cost functions defined in (1) and (2) are related by: for P = { P i } ki =1 , cost ( P ) ≥ cost ( P, X ) if X = { µ ( P i ) } ki =1 ; cost ( P, X ) ≥ cost ( P P,X ) .Proof. For 1), since x i := µ ( P i ) is the optimal center of P i ,cost ( P ) = k X i =1 X p ∈ P i k p − x i k = X p ∈ P k p − x i p k ≥ X p ∈ P min x i ∈ X k p − x i k = cost ( P, X ) , where i p is the index of the cluster P i such that p ∈ P i .For 2), by definition of P P,X = ( P , . . . , P k ) ( k := | X | ),cost ( P, X ) = k X i =1 X p ∈ P i k p − x i k ≥ k X i =1 min q ∈ R d X p ∈ P i k p − q k = cost ( P P,X ) , (5)where (5) is by the definition in (2).Next we show the approximation error of first applying DR method π , followed by CRmethod π is bounded as follows. Theorem III.1.
In Algorithm 1, let X ′ be the optimal k -means centers of S ′ := π ( π ( P )) ,where π is an ǫ -projection and π generates an ǫ -coreset for ǫ ∈ (0 , . Then X := { µ ( P i ) } ki =1 ,where { P , . . . , P k } = π − ( P π ( P ) ,X ′ ) , is a (1 + ǫ ) / (1 − ǫ ) -approximation for k -means clusteringof P .Proof. Let X ∗ denote the optimal set of k -means centers for P and ˜ X ∗ denote the set of meansfor each cluster under the partition π ( P P,X ∗ ) ( ˜ X ∗ = π ( X ∗ ) if π is a linear map). Let P ′ denote π ( P ) and S ′ := ( S ′ , ∆ , w ) denote π ( P ′ ) . Then (1 + ǫ ) cost ( P, X ∗ ) = (1 + ǫ ) cost ( P P,X ∗ ) (6) ≥ cost ( π ( P P,X ∗ )) (7) ≥ cost ( P ′ , ˜ X ∗ ) (8) ≥
11 + ǫ cost ( S ′ , ˜ X ∗ ) (9) ≥
11 + ǫ cost ( S ′ , X ′ ) (10) ≥ − ǫ ǫ cost ( P ′ , X ′ ) (11) ≥ − ǫ ǫ cost ( P P ′ ,X ′ ) (12) ≥ − ǫ (1 + ǫ ) cost ( π − ( P P ′ ,X ′ )) (13) ≥ − ǫ (1 + ǫ ) cost ( P, X ) , (14) Specifically, for S ′ = ( S ′ , ∆ , w ) , one can ignore ∆ , and apply a weighted k -means algorithm to minimize P q ∈ S ′ w ( q ) · min x i ∈ X k q − x i k , or convert it into an unweighted dataset by duplicating each q ∈ S ′ for w ( q ) times and apply an unweighted k -means algorithm. Algorithm 1: k -Means under Generic DR+CR input : Original dataset P , number of centers k , DR method π , CR method π output: Centers for k -means clustering of P P ′ ← π ( P ) ; ( S ′ , ∆ , w ) ← π ( P ′ ) ; X ′ ← kmeans ( S ′ , w, k ) ; P ′ ← P P ′ ,X ′ ; P ← π − ( P ′ ) ( P = { P , . . . , P k } ); foreach i = 1 , . . . , k do x i ← µ ( P i ) ; return { x i } ki =1 ; where (6) is due to the optimality of X ∗ , (7) is because π is an ǫ -projection, (8) is byLemma III.1.1), (9) is because S ′ is an ǫ -coreset of P ′ , (10) is because X ′ is optimal for S ′ , (11) is because S ′ is an ǫ -coreset of P ′ , (12) is by Lemma III.1.2), (13) is because π isan ǫ -projection, and (14) is by Lemma III.1.1). The last inequality (14) gives the desired upperbound on cost ( P, X ) . Discussion: If P resides on a data source and the k -means computation is performed by aserver, then we have to transmit the entire dataset P in order to implement Algorithm 1, as itssolution is an explicit function of a partition of P . Thus, Algorithm 1 is only useful for reducingthe complexity in solving k -means, but is not useful for reducing the communication cost. Incontrast, for certain DR/CR methods, it is possible to achieve a similar approximation error ata reduced communication cost, as shown below.
2) An Existing DR+CR Algorithm:
The state-of-the-art joint DR and CR algorithm, referredto as
FSS following the authors’ last names, was implicitly presented in Theorem 36 [8]. Theapproximation error and the communication cost of FSS (for transmitting its output) were notgiven in [8]. Thus, we provide them to facilitate later comparison.
Corollary III.1.1.
Suppose that the data source reports the coreset S := ( S, ∆ , w ) computed byFSS [8] and the server computes the optimal k -means centers X of S . Then: X is a (1 + ǫ ) / (1 − ǫ ) -approximation for k -means clustering of P with probability ≥ − δ ; the communication cost is O ( kd/ǫ ) ,assuming min( n, d ) ≫ k, /ǫ , and /δ . Proof.
For 1), let X ∗ denote the optimal k -means centers of P . Since S is an ǫ -coreset withprobability ≥ − δ (Theorem II.2), by Definition II.2, the following holds with probability ≥ − δ : cost ( P, X ) ≤ − ǫ cost ( S , X ) ≤ − ǫ cost ( S , X ∗ ) ≤ ǫ − ǫ cost ( P, X ∗ ) . (15)For 2), the cost of transferring ( S, ∆ , w ) is dominated by the cost of transferring S . Since S lies in a d ′ -dimensional subspace spanned by the columns of V ( d ′ ) , it suffices to transmit thecoordinates of each point in S in this subspace together with V ( d ′ ) . The former incurs a cost of O ( | S | · d ′ ) , and the latter incurs a cost of O ( dd ′ ) . Plugging d ′ = O ( k/ǫ ) and | S | = ˜ O ( k /ǫ ) from Theorem II.2 yields the overall communication cost as O ( dk/ǫ ) .
3) Communication-efficient DR+CR:
Now the question is: can we further improve the com-munication cost without hurting approximation and complexity?Our key observation is that, for FSS, the linear communication cost in d is due to thetransmission of the projected subspace. In contrast, JL projections are data-oblivious . Thus,we can circumvent the linear communication cost by employing a JL projection as the DRmethod. The following is directly implied by the JL Lemma (Lemma II.1). Lemma III.2.
Let π : R d → R d ′ be a JL projection. Then there exists d ′ = O ( ǫ − log( nk/δ )) such that for any P ⊂ R d with | P | = n and X, X ∗ ⊂ R d with | X | = | X ∗ | = k , the followingholds with probability at least − δ :cost ( P, X ) ≈ (1+ ǫ ) cost ( π ( P ) , π ( X )) , (16) cost ( P, X ∗ ) ≈ (1+ ǫ ) cost ( π ( P ) , π ( X ∗ )) . (17) Proof.
Let δ ′ = δ/ (2 nk ) . By the JL Lemma (Lemma II.1), there exists d ′ = O ( ǫ − log(1 /δ ′ )) = O ( ǫ − log( nk/δ )) , such that every x ∈ R d satisfies k π ( x ) k ≈ ǫ k x k with probability ≥ − δ ′ .By the union bound, this implies that with probability ≥ − δ , every p − x i for p ∈ P and x i ∈ X ∪ X ∗ satisfies k π ( p ) − π ( x i ) k ≈ ǫ k p − x i k . Therefore, with probability ≥ − δ ,cost ( π ( P ) , π ( X )) = X p ∈ P min x i ∈ X k π ( p ) − π ( x i ) k (18) ≤ X p ∈ P min x i ∈ X (1 + ǫ ) k p − x i k Algorithm 2: k -Means under Communication-efficient DR+CR input : Original dataset P , number of centers k , JL projection π , FSS-based CR method π output: Centers for k -means clustering of P data source: P ′ ← π ( P ) ; ( S ′ , ∆ , w ) ← π ( P ′ ) ; report ( S ′ , ∆ , w ) to the server; server: X ′ ← kmeans ( S ′ , w, k ) ; X ← π − ( X ′ ) ; return X ; = (1 + ǫ ) cost ( P, X ) , (19)(18) ≥ X p ∈ P min x i ∈ X ǫ ) k p − x i k = 1(1 + ǫ ) cost ( P, X ) . (20)Combining (19) and (20) proves (16). Similar argument will prove (17).Using a JL projection for DR and FSS for CR, we propose Algorithm 2, which differs fromAlgorithm 1 in that it directly computes the centers in the original space from the optimalcenters in the low-dimensional space (line 7). Thus the communication cost is reduced by onlytransmitting ( S ′ , ∆ , w ) . The following is the performance analysis of Algorithm 2. Theorem III.2.
For any ǫ, δ ∈ (0 , , if in Algorithm 2, π satisfies Lemma III.2, π generatesan ǫ -coreset with probability at least − δ , and kmeans ( S ′ , w, k ) returns the optimal k -meanscenters of the dataset S ′ with weights w , then the output X is a (1 + ǫ ) / (1 − ǫ ) -approximation for k -means clustering of P withprobability at least (1 − δ ) , the communication cost is O ( kǫ − log n ) , and the complexity at the data source is ˜ O ( ndǫ − ) ,assuming min( n, d ) ≫ k, /ǫ , and /δ .Proof. For 1), let X ∗ be the optimal k -means centers of P and S ′ := ( S ′ , ∆ , w ) be generated inline 3 of Algorithm 2. With probability at least (1 − δ ) , π satisfies (16, 17) and π generates an ǫ -coreset. Thus, with probability at least (1 − δ ) ,cost ( P, X ) ≤ (1 + ǫ ) cost ( π ( P ) , X ′ ) (21) ≤ (1 + ǫ ) − ǫ cost ( S ′ , X ′ ) (22) ≤ (1 + ǫ ) − ǫ cost ( S ′ , π ( X ∗ )) (23) ≤ (1 + ǫ ) − ǫ cost ( π ( P ) , π ( X ∗ )) (24) ≤ (1 + ǫ ) − ǫ cost ( P, X ∗ ) , (25)where (21) is by (16) and that π ( X ) = X ′ , (22) is because S ′ is an ǫ -coreset of π ( P ) , (23) isbecause X ′ minimizes cost ( S ′ , · ) , (24) is again because S ′ is an ǫ -coreset of π ( P ) , and (25) isby (17).For 2), the communication cost is dominated by transmitting S ′ . By Lemma III.2, the dimen-sion of P ′ is d ′ = O ( ǫ − log( nk/δ )) = O ( ǫ − log n ) . By Theorem II.2, the cardinality of S ′ is | S ′ | = O ( k ǫ − log ( k ) log(1 /δ )) . Moreover, points in S ′ lie in a ˜ d -dimensional subspace for ˜ d = O ( k/ǫ ) . Thus, it suffices to transmit the coordinates of points in S ′ in the ˜ d -dimensionalsubspace and a basis of the subspace. Thus, the total communication cost is O (( | S ′ | + d ′ ) ˜ d ) = O (cid:18) k ǫ log ( k ) log( 1 δ ) + kǫ log n (cid:19) = O (cid:18) k log nǫ (cid:19) . (26)For 3), note that for a given projection matrix Π ∈ R d × d ′ such that π ( P ) := A P Π , line 2 takes O ( ndd ′ ) = O ( ndǫ − log n ) time, where we have plugged in d ′ = O ( ǫ − log n ) . By Theorem II.2,line 3 takes time O (cid:18) min( nd ′ , n d ′ ) + nkǫ (cid:0) d ′ + k log( 1 δ ) (cid:1)(cid:19) = O (cid:18) nǫ (cid:16) log nǫ + k log nǫ + k log( 1 δ ) (cid:17)(cid:19) . (27)Thus, the total complexity at the data source is: O (cid:18) nǫ (cid:16) log nǫ + k log nǫ + d log n + k log( 1 δ ) (cid:17)(cid:19) = O (cid:18) ndǫ log n (cid:19) = ˜ O (cid:18) ndǫ (cid:19) . (28) Remark:
We only focus on the complexity at the data source as the server is usually muchmore powerful. Theorem III.2 shows that Algorithm 2 can solve k -means arbitrarily close to theoptimal with an arbitrarily high probability, while incurring a complexity at the data source thatis roughly linear in the data size (i.e., nd ) and a communication cost that is roughly logarithmicin the data cardinality (i.e., n ). B. CR+DR
While Algorithm 2 can drastically reduce the communication cost without incurring muchcomputation, it remains unclear whether its order of applying DR and CR is optimal. To thisend, we consider the approach of applying CR first.
1) A Black-box Approach:
Now suppose that we first apply a generic CR method π and thenapply a generic DR method π . It turns out that the quality of the resulting k -means solution isthe same as that in Theorem III.1, as stated below; the proof is similar to that of Theorem III.1. Theorem III.3.
Let X ′ be the optimal k -means centers of S ′ := ( π ( S ) , ∆ , w ) for ( S, ∆ , w ) := π ( P ) , where π is an ǫ -projection and π generates an ǫ -coreset for ǫ ∈ (0 , . Then X := { P q ∈ S i w ( q ) q P p ∈ Si w ( p ) } ki =1 , where { S , . . . , S k } = π − ( P π ( S ) ,X ′ ) , is a (1 + ǫ ) / (1 − ǫ ) -approximationfor k -means clustering of P .Proof. Let X ∗ be the set of optimal k -means centers for P . Let S = ( S, ∆ , w ) := π ( P ) . As thecost functions (1) and (2) are defined for unweighted datasets, we use a trick in [8] to convertthe weighted dataset S into an unweighted dataset ˜ S by duplicating each q ∈ S for w ( q ) times .For any set of centers Q and cost ( S , Q ) defined in Definition II.2, we havecost ( S , Q ) = cost ( ˜ S, Q ) + ∆ , (29)where cost ( ˜ S, Q ) is defined as in (1). Similarly,cost ( S ′ , Q ) = cost ( π ( ˜ S ) , Q ) + ∆ . (30)We then have (1 + ǫ ) cost ( P, X ∗ ) ≥ cost ( S , X ∗ ) (31) ≥ cost ( P ˜ S,X ∗ ) + ∆ (32) ≥
11 + ǫ ( cost ( π ( P ˜ S,X ∗ )) + ∆) (33) ≥
11 + ǫ ( cost ( P π ( ˜ S ) ,X ′ ) + ∆) (34) ≥ ǫ ) ( cost ( π − ( P π ( ˜ S ) ,X ′ )) + ∆) (35) ≥ ǫ ) ( cost ( ˜ S, X ) + ∆) (36) Strictly speaking, we need to replace each q ∈ S by ⌈ w ( q ) /δ ⌉ identical points with weight δ each. The total weight of thesepoints approximates w ( q ) as δ → . Algorithm 3: k -Means under Generic CR+DR input : Original dataset P , number of centers k , DR method π , CR method π output: Centers for k -means clustering of P ( S, ∆ , w ) ← π ( P ) ; S ′ ← π ( S ) ; X ′ ← kmeans ( S ′ , w, k ) ; S ′ ← P S ′ ,X ′ ; S ← π − ( S ′ ) ( S = { S , . . . , S k } ); foreach i = 1 , . . . , k do x i ← P p ∈ Si w ( p ) P q ∈ S i w ( q ) q ; return { x i } ki =1 ; ≥ − ǫ (1 + ǫ ) cost ( P, X ) , (37)where (31) is because S is an ǫ -coreset of P , (32) is because of (29) and Lemma III.1.2),(33) is because π is an ǫ -projection, (34) is because X ′ minimizes cost ( S ′ , Q ) and by (30)it minimizes cost ( π ( ˜ S ) , Q ) (and hence cost ( P π ( ˜ S ) ,Q ) ), (35) is because π is an ǫ -projection,(36) is by noticing that the X defined in the theorem is the set of means for the clusters in π − ( P π ( ˜ S ) ,X ′ ) and then applying Lemma III.1.1), and (37) is because of (29) and that S is an ǫ -coreset of P . The last inequality (37) gives the desired upper bound on cost ( P, X ) . Discussion: If P resides on a data source and k -means is performed by a server, then we onlyneed to transmit the coreset ( S, ∆ , w ) to implement Algorithm 3. Compared to the approachof applying DR first as in Algorithm 1, applying CR first allows us to substantially reduce thecommunication cost for datasets with large cardinalities. However, as shown below, utilizingsuitably selected CR/DR methods can further reduce the communication cost.
2) Communication-efficient CR+DR:
Recall that applying DR first can reduce the communi-cation cost to be logarithmic in the size of the original dataset (Theorem III.2). Below we willshow that when applying CR first, this cost can be reduced to a constant, i.e., independent ofboth the cardinality n and the dimension d of the original dataset.We again choose JL projection as the DR method to avoid transmitting the projection matrix atruntime and FSS as the CR method, which generates an ǫ -coreset with the minimum cardinalityamong the existing CR methods for k -means. The algorithm, shown in Algorithm 4, differs fromAlgorithm 2 in that the order of applying DR and CR is reversed. Algorithm 4: k -Means under Communication-efficient CR+DR input : Original dataset P , number of centers k , JL projection π , FSS-based CR method π output: Centers for k -means clustering of P data source: ( S, ∆ , w ) ← π ( P ) ; S ′ ← π ( S ) ; report ( S ′ , ∆ , w ) to the server; server: X ′ ← kmeans ( S ′ , w, k ) ; X ← π − ( X ′ ) ; return X ; We now analyze the performance of Algorithm 4, starting with a counterpart of Lemma III.2.
Lemma III.3.
Let π : R d → R d ′ be a JL projection. Then there exists d ′ = O ( ǫ − log( n ′ k/δ )) such that for any coreset S := ( S, ∆ , w ) , where S ⊂ R d with | S | = n ′ , w : S → R , and ∆ ∈ R ,and any X, X ∗ ⊂ R d with | X | = | X ∗ | = k , the following holds with probability at least − δ :cost ( S , X ) ≈ (1+ ǫ ) cost (( π ( S ) , ∆ , w ) , π ( X )) , (38) cost ( S , X ∗ ) ≈ (1+ ǫ ) cost (( π ( S ) , ∆ , w ) , π ( X ∗ )) . (39) Proof.
The proof is analogous to that of Lemma III.2. Let δ ′ = δ/ (2 n ′ k ) . Then there exists d ′ = O ( ǫ − log(1 /δ ′ )) = O ( ǫ − log( n ′ k/δ )) , such that every x ∈ R d satisfies k π ( x ) k ≈ ǫ k x k with probability ≥ − δ ′ . By the union bound, this implies that with probability ≥ − δ , every p ∈ S and x i ∈ X ∪ X ∗ satisfy k π ( p ) − π ( x i ) k ≈ ǫ k p − x i k . Therefore,cost (( π ( S ) , ∆ , w ) , π ( X ))= X p ∈ S w ( p ) · min x i ∈ X k π ( p ) − π ( x i ) k + ∆ (40) ≤ (1 + ǫ ) X p ∈ S w ( p ) · min x i ∈ X k p − x i k + ∆ ! = (1 + ǫ ) cost ( S , X ) , (41)(40) ≥ ǫ ) X p ∈ S w ( p ) · min x i ∈ X k p − x i k + ∆ ! = 1(1 + ǫ ) cost ( S , X ) , (42)which prove (38). Similar argument proves (39).Below, we will show that Algorithm 4 achieves the same approximation as Algorithm 2, butat a different communication cost and a different complexity. Theorem III.4.
For any ǫ, δ ∈ (0 , , if in Algorithm 4, π satisfies Lemma III.3, π generatesan ǫ -coreset with probability at least − δ , and kmeans ( S ′ , w, k ) returns the optimal k -meanscenters of the dataset S ′ with weights w , then the output X is an (1 + ǫ ) / (1 − ǫ ) -approximation for k -means clustering of P withprobability ≥ (1 − δ ) , the communication cost is ˜ O ( k /ǫ ) , and the complexity at the data source is O ( nd · min( n, d )) ,assuming min( n, d ) ≫ k, /ǫ , and /δ .Proof. For 1), let X ∗ be the optimal k -means centers of P and S := ( S, ∆ , w ) generated inline 2 of Algorithm 4. With probability at least (1 − δ ) , S is an ǫ -coreset of P , and π satisfies(38, 39). Thus, with this probability,cost ( P, X ) ≤ − ǫ cost ( S , X ) (43) ≤ (1 + ǫ ) − ǫ cost (( S ′ , ∆ , w ) , X ′ ) (44) ≤ (1 + ǫ ) − ǫ cost (( S ′ , ∆ , w ) , π ( X ∗ )) (45) ≤ (1 + ǫ ) − ǫ cost ( S , X ∗ ) (46) ≤ (1 + ǫ ) − ǫ cost ( P, X ∗ ) , (47)where (43) is because S is an ǫ -coreset of P , (44) is due to (38) (note that π ( S ) = S ′ and π ( X ) = X ′ ), (45) is because X ′ minimizes cost (( S ′ , ∆ , w ) , · ) , (46) is due to (39), and (47) isagain because S is an ǫ -coreset of P .For 2), note that by Theorem II.2, the cardinality of S needs to be n ′ = O ( k ǫ − log ( k ) log(1 /δ )) .By Lemma III.3, the dimension of S ′ needs to be d ′ = O ( ǫ − log( n ′ k/δ )) . Thus, the cost oftransmitting ( S ′ , ∆ , w ) , dominated by the cost of transmitting S ′ , is O ( n ′ d ′ ) = O (cid:18) k log kǫ log( 1 δ ) (cid:16) log k + log( 1 ǫ ) + log( 1 δ ) (cid:17)(cid:19) = ˜ O (cid:18) k ǫ (cid:19) . (48) For 3), we know from Theorem II.2 that line 2 of Algorithm 4 takes time O (min( nd , n d ) + nkǫ − ( d + k log(1 /δ ))) . Given a projection matrix Π ∈ R d × d ′ such that π ( S ) := A S Π , line 3takes time O ( n ′ dd ′ ) . Thus, the total complexity at the data source is O min( nd , n d )+ kǫ nd + k log kǫ n + k log k (log k +log( ǫ )) ǫ d ! = O ( nd · min( n, d )) . (49) C. Repeated DR/CR
Theorems III.2 and III.4 state that to achieve the same approximation error with the sameprobability, DR+CR (Algorithm 2) incurs a communication cost of O ( kǫ − log n ) and a com-plexity of ˜ O ( ǫ − nd ) , while CR+DR (Algorithm 4) incurs a communication cost of ˜ O ( k ǫ − ) and a complexity of O ( nd · min( n, d )) . This shows a communication-computation tradeoff : theapproach of DR+CR incurs a linear complexity and a logarithmic communication cost, whereasthe approach of CR+DR incurs a super-linear complexity (which is still less than quadratic)and a constant communication cost. One may wonder whether it is possible to combine thestrengths of both of the algorithms. Below we give an affirmative answer by applying some ofthese methods repeatedly.Specifically, we know from Theorem II.2 that applying FSS once already reduces the cardinal-ity to a constant (in n and d ), and hence there is no need to repeat FSS. The same theorem alsoimplies that if we apply FSS first, we will incur a super-linear complexity, and hence we need toapply JL projection before FSS. Meanwhile, we see from Lemmas III.2 and III.3 that applyingJL projection on a dataset of cardinality n ′ can reduce its dimension to O ( ǫ − log( n ′ k/δ )) whileachieving a (1 + O ( ǫ )) -approximation with high probability. Thus, we can further reduce thedimension by applying JL projection again after reducing the cardinality by FSS. The abovereasoning suggests a three-step procedure: JL → FSS → JL, presented in Algorithm 5, where π (1)1 projects from R d to R O (log n/ǫ ) , and π (2)1 projects from R O (log n/ǫ ) to R O (log | S | /ǫ ) . Note that byconvention, π (2)1 ◦ π (1)1 ( X ) means π (2)1 ( π (1)1 ( X )) .Below, we will show that this seemingly small change is able to combine the low communi-cation cost of Algorithm 4 and the low complexity of Algorithm 2, at a small increase in theapproximation error.9
Theorems III.2 and III.4 state that to achieve the same approximation error with the sameprobability, DR+CR (Algorithm 2) incurs a communication cost of O ( kǫ − log n ) and a com-plexity of ˜ O ( ǫ − nd ) , while CR+DR (Algorithm 4) incurs a communication cost of ˜ O ( k ǫ − ) and a complexity of O ( nd · min( n, d )) . This shows a communication-computation tradeoff : theapproach of DR+CR incurs a linear complexity and a logarithmic communication cost, whereasthe approach of CR+DR incurs a super-linear complexity (which is still less than quadratic)and a constant communication cost. One may wonder whether it is possible to combine thestrengths of both of the algorithms. Below we give an affirmative answer by applying some ofthese methods repeatedly.Specifically, we know from Theorem II.2 that applying FSS once already reduces the cardinal-ity to a constant (in n and d ), and hence there is no need to repeat FSS. The same theorem alsoimplies that if we apply FSS first, we will incur a super-linear complexity, and hence we need toapply JL projection before FSS. Meanwhile, we see from Lemmas III.2 and III.3 that applyingJL projection on a dataset of cardinality n ′ can reduce its dimension to O ( ǫ − log( n ′ k/δ )) whileachieving a (1 + O ( ǫ )) -approximation with high probability. Thus, we can further reduce thedimension by applying JL projection again after reducing the cardinality by FSS. The abovereasoning suggests a three-step procedure: JL → FSS → JL, presented in Algorithm 5, where π (1)1 projects from R d to R O (log n/ǫ ) , and π (2)1 projects from R O (log n/ǫ ) to R O (log | S | /ǫ ) . Note that byconvention, π (2)1 ◦ π (1)1 ( X ) means π (2)1 ( π (1)1 ( X )) .Below, we will show that this seemingly small change is able to combine the low communi-cation cost of Algorithm 4 and the low complexity of Algorithm 2, at a small increase in theapproximation error.9 Algorithm 5: k -Means under Communication-efficient DR+CR+DR input : Original dataset P , number of centers k , JL projection π (1)1 for P , FSS-based CR method π , JLprojection π (2)1 for the output of π output: Centers for k -means clustering of P data source: P ′ ← π (1)1 ( P ) ; ( S, ∆ , w ) ← π ( P ′ ) ; S ′ ← π (2)1 ( S ) ; report ( S ′ , ∆ , w ) to the server; server: X ′ ← kmeans ( S ′ , w, k ) ; X ← ( π (2)1 ◦ π (1)1 ) − ( X ′ ) ; return X ; Theorem III.5.
For any ǫ, δ ∈ (0 , , if in Algorithm 5, π (1)1 satisfies Lemma III.2, π (2)1 satisfiesLemma III.3, π generates an ǫ -coreset of its input dataset with probability at least − δ , andkmeans ( S ′ , w, k ) returns the optimal k -means centers of the dataset S ′ with weights w , then the output X is a (1 + ǫ ) / (1 − ǫ ) -approximation for k -means clustering of P withprobability ≥ (1 − δ ) , the communication cost is ˜ O ( k /ǫ ) , and the complexity at the data source is ˜ O ( nd/ǫ ) ,assuming min( n, d ) ≫ k, /ǫ, and /δ .Proof. Let n ′ := | S | , d ′ be the dimension after π (1)1 , and d ′′ be the dimension after π (2)1 . Let X ∗ be the optimal k -means centers for P .For 1), note that with probability ≥ (1 − δ ) , π (1)1 and π (2)1 will preserve the k -means cost upto a multiplicative factor of (1 + ǫ ) , and π will generate an ǫ -coreset of P ′ . Thus, with thisprobability, we have cost ( P, X ) ≤ (1 + ǫ ) cost ( P ′ , π (1)1 ( X )) (50) ≤ (1 + ǫ ) − ǫ cost (( S, ∆ , w ) , π (1)1 ( X )) (51) ≤ (1 + ǫ ) − ǫ cost (( S ′ , ∆ , w ) , π (2)1 ◦ π (1)1 ( X )) (52) ≤ (1 + ǫ ) − ǫ cost (( S ′ , ∆ , w ) , π (2)1 ◦ π (1)1 ( X ∗ )) (53) ≤ (1 + ǫ ) − ǫ cost (( S, ∆ , w ) , π (1)1 ( X ∗ )) (54) ≤ (1 + ǫ ) − ǫ cost ( P ′ , π (1)1 ( X ∗ )) (55) ≤ (1 + ǫ ) − ǫ cost ( P, X ∗ ) , (56)where (50) is by Lemma III.2, (51) is because ( S, ∆ , w ) is an ǫ -coreset of P ′ , (52) is byLemma III.3, (53) is because π (2)1 ◦ π (1)1 ( X ) = X ′ , which is optimal in minimizing cost (( S ′ , ∆ , w ) , · ) ,(54) is by Lemma III.3, (55) is because ( S, ∆ , w ) is an ǫ -coreset of P ′ , and (56) is by Lemma III.2.For 2), note that by Theorem II.2, the cardinality of the coreset constructed by FSS is n ′ = O ( k log kǫ − log(1 /δ )) , which is independent of the dimension of the input dataset. Thus, thecommunication cost remains the same as that of Algorithm 4, which is ˜ O ( k /ǫ ) .For 3), note that the first JL projection π (1)1 takes O ( ndd ′ ) time, where d ′ = O (log n/ǫ ) by Lemma III.2, and the second JL projection π (2)1 takes O ( n ′ d ′ d ′′ ) time, where n ′ is specifiedby Theorem II.2 as above and d ′′ = O ( ǫ − log( n ′ k/δ )) by Lemma III.3. Moreover, from theproof of Theorem III.2, we know that applying FSS after a JL projection takes O ( nǫ (log n/ǫ + k log n/ǫ + k log δ )) time. Thus, the total complexity at the data source is O (cid:18) ndd ′ + nǫ (cid:16) log nǫ + k log nǫ + k log 1 δ (cid:17) + n ′ d ′ d ′′ (cid:19) = O (cid:18) nd log nǫ + n log nǫ (cid:19) = ˜ O (cid:18) ndǫ (cid:19) . (57) Remark:
Theorem III.5 implies that Algorithm 5 is essentially “optimal” in the sense that itachieves a (1 + O ( ǫ )) -approximation with an arbitrarily high probability, at a near-linear com-plexity and a constant communication cost at the data source. Thus, no qualitative improvementwill be achieved by applying further DR/CR methods.IV. J OINT DR AND CR ACROSS M ULTIPLE D ATA S OURCES
Consider the scenario where the entire dataset P is split across m data sources ( m ≥ ). Let P i denote the dataset at data source i and n i be its cardinality. As shown below, the previousalgorithms can be adapted to the distributed setting. A. Distributed Version of FSS
It turns out that the state-of-the-art distributed DR and CR algorithm, proposed in [17,Algorithm 1], is exactly a distributed version of FSS, referred to as
BKLW following the authors’last names. As in FSS, BKLW first uses PCA to reduce the intrinsic dimension of the dataset andthen applies sensitivity sampling. However, it uses distributed algorithms to perform these steps.For distributed PCA, BKLW applies an algorithm disPCA from [8] (formalized in Algorithm 1in [28]), where:1) each data source i ( i = 1 , . . . , m ) computes local SVD A P i = U i Σ i V Ti , and sends Σ ( t ) i and V ( t ) i to the server ( Σ ( t ) i and V ( t ) i contain the first t columns of Σ i and V i , respectively);2) the server constructs Y T = [ Y T , . . . , Y Tm ] , with Y i = Σ ( t ) i ( V ( t ) i ) T , computes a global SVD Y = U Σ V T ;3) the first t columns of V are returned as an approximate solution to the PCA of S mi =1 P i .For distributed sensitivity sampling, BKLW applies an algorithm disSS from [11] (Algo-rithm 1), where:1) each data source i ( i = 1 , . . . , m ) computes a bicriteria approximation X i for P i andreports cost ( P i , X i ) ;2) the server allocates a global sample size s to each data source proportionally to its cost,i.e., s i = s · cost ( P i , X i ) / (cid:0) P mj =1 cost ( P j , X j ) (cid:1) ;3) each data source i draws s i i.i.d. samples S i from P i with probability proportional tocost ( { p } , X i ) , and reports S i ∪ X i with their weights w : S i ∪ X i → R , that are set tomatch the number of points per cluster;4) the union of the reported sets ( S mi =1 ( S i ∪ X i ) , , w ) is returned as a coreset of S mi =1 P i .BKLW first applies disPCA, followed by disSS with s = O ( ǫ − ( k /ǫ +log(1 /δ ))+ mk log( mk/δ )) to the dimension-reduced dataset { A P i V ( t ) ( V ( t ) ) T } mi =1 to compute a coreset ( S, , w ) at theserver. Finally, the server computes the optimal k -means centers X on ( S, , w ) and returns itas an approximation to the optimal k -means centers of S mi =1 P i .Although a theorem was given in [17] without proof on the performance of BKLW, the resultis imprecise and incomplete. Hence, we provide the complete analysis below to facilitate latercomparison. We will leverage the following results. Theorem IV.1 ([28]) . For any ǫ ∈ (0 , / , let t = t ≥ k + ⌈ k/ǫ ⌉ − in disPCA and ˜ P i be the projected dataset at data source i (i.e., the set of rows of A P i V ( t ) ( V ( t ) ) T ). Then there exists a constant ∆ ≥ such that for any set X ⊂ R d with | X | = k , (1 − ǫ ) cost ( P, X ) ≤ cost ( ˜ P , X )+∆ ≤ (1+ ǫ ) cost ( P, X ) , (58) where P := S mi =1 P i and ˜ P := S mi =1 ˜ P i . Theorem IV.2 ([11]) . For a distributed dataset { P i } mi =1 with P i ⊂ R d and any ǫ, δ ∈ (0 , , withprobability at least − δ , the output ( S, , w ) of disSS is an ǫ -coreset of S mi =1 P i of size | S | = O (cid:18) ǫ (cid:16) kd + log( 1 δ ) (cid:17) + mk log( mkδ ) (cid:19) . (59)Theorems IV.1 and IV.2 bound the performance of disPCA and disSS, respectively, based onwhich we have the following results for BKLW. Theorem IV.3.
For any ǫ ∈ (0 , / and δ ∈ (0 , , suppose that in BKLW, disPCA satisfiesTheorem IV.1 for the input dataset { P i } mi =1 and disSS satisfies Theorem IV.2 for the input dataset { ˜ P i } mi =1 . Then the output X is a (1 + ǫ ) / (1 − ǫ ) -approximation for k -means clustering of S mi =1 P i withprobability ≥ − δ , the total communication cost over all the data sources is O ( mkd/ǫ ) , and the complexity at each data source is O ( nd · min( n, d )) ,assuming min( n, d ) ≫ m, k, /ǫ, and /δ .Proof. For 1), let P := S mi =1 P i , ˜ P := S mi =1 ˜ P i , and S := ( S, , w ) be the output of disSS.Let X ∗ be the optimal k -means centers of P . By Theorem IV.2, we know that with probability ≥ − δ , cost ( ˜ P , X ) ≤ − ǫ cost ( S , X ) ≤ − ǫ cost ( S , X ∗ ) ≤ ǫ − ǫ cost ( ˜ P , X ∗ ) , (60)where the second inequality is because X is optimal for S . Moreover, by Theorem IV.1, wehave (1 − ǫ ) cost ( P, X ) − ∆ ≤ cost ( ˜ P , X ) , (61)cost ( ˜ P , X ∗ ) ≤ (1 + ǫ ) cost ( P, X ∗ ) − ∆ . (62)Combining (60, 61, 62) yields (1 − ǫ ) cost ( P, X ) − ∆ ≤ ǫ − ǫ · ((1 + ǫ ) cost ( P, X ∗ ) − ∆) ≤ (1 + ǫ ) − ǫ cost ( P, X ∗ ) − ∆ , (63)which gives the desired approximation factor.For 2), note that disPCA incurs a cost of O ( m · ( k/ǫ ) · d ) for transmitting O ( k/ǫ ) vectors in R d from each of the m data sources, and disSS incurs a cost of O (cid:16) kǫ · ( ǫ − ( k ǫ + log δ ) + mk log mkδ ) (cid:17) for transmitting O ( ǫ − ( k ǫ + log δ ) + mk log mkδ ) vectors in R O ( k/ǫ ) . For d ≫ m, k, /ǫ, and /δ ,the total communication cost is dominated by the cost of disPCA.For 3), as computing the local SVD at data source i takes O ( n i d · min( n i , d )) time, thecomplexity of disPCA at the data sources is O ( nd · min( n, d )) . The complexity of disSS atdata source i is dominated by the computation of bicriteria approximation of ˜ P i , which takes O ( n i t k log δ ) = O ( nk ǫ − log δ ) according to [29]. For min( n, d ) ≫ m, k, /ǫ, and /δ , theoverall complexity is dominated by that of disPCA. B. Enhancements
It is easy to see that each data source can apply JL projection independently at no additionalcommunication cost at runtime. Following the ideas in Algorithms 2 and 4, we wonder: (i) Canwe improve BKLW by combining it with JL projection? (ii) Is there an optimal order of applyingBKLW and JL projection?To this end, we first consider applying JL projection before invoking BKLW. For consistencywith Algorithm 2, we only use the first two steps of BKLW, i.e., disPCA and disSS, that constructa coreset, which we refer to as a
BKLW-based CR method . The algorithm, shown in Algorithm 6,is essentially the distributed counterpart of Algorithm 2.We now analyze the performance of Algorithm 6, starting from a coreset-like property of π . Lemma IV.1.
Let P := S mi =1 P i be the union of the input datasets for the BKLW-based CRmethod π and S := ( S, , w ) be the resulting coreset reported to the server. For any ǫ ∈ (0 , / and δ ∈ (0 , , ∃ t = t = O ( k/ǫ ) , s = O ( ǫ − ( k /ǫ + log(1 /δ )) + mk log( mk/δ )) , and ∆ ≥ ,such that with probability at least − δ , π with parameters t , t , and s satisfies (1 − ǫ ) cost ( P, X ) ≤ cost ( S , X )+∆ ≤ (1+ ǫ ) cost ( P, X ) (64) for any set X of k points in the same space as P .Proof. Let ˜ P be the projection of P using the principal components computed by disPCA. Then Algorithm 6:
Distributed k -Means under DR+CR input : Distributed dataset { P i } mi =1 , number of centers k , JL projection π , BKLW-based CR method π output: Centers for k -means clustering of P each data source i ( i = 1 , . . . , m ): P ′ i ← π ( P i ) ; run π on the distributed dataset { P ′ i } mi =1 , which results in each data source i reporting a local coreset ( S ′ i , , w ) to the server; server: X ′ ← kmeans ( S mi =1 S ′ i , w, k ) ; X ← π − ( X ′ ) ; return X ; by Theorem IV.1, there exists ∆ ≥ such that (1 − ǫ ) cost ( P, X ) ≤ cost ( ˜ P , X )+∆ ≤ (1+ ǫ ) cost ( P, X ) . (65)Moreover, by Theorem IV.2, S is an ǫ -coreset of ˜ P with probability at least − δ . Multiplying(65) by − ǫ , we have (1 − ǫ ) cost ( P, X ) ≤ (1 − ǫ ) cost ( ˜ P , X ) + (1 − ǫ )∆ (66) ≤ cost ( S , X ) + ∆ , (67)where we can obtain (67) from (66) because S is an ǫ -coreset of ˜ P . Similarly, multiplying (65)by ǫ , we have (1 + ǫ ) cost ( P, X ) ≥ (1 + ǫ ) cost ( ˜ P , X ) + (1 + ǫ )∆ ≥ cost ( S , X ) + ∆ . (68)Combining (67, 68) yields the desired bound. Remark:
Comparing Lemma IV.1 with Definition II.2, we see that π does not construct an O ( ǫ ) -coreset of its input dataset. Nevertheless, its output can approximate the k -means cost ofthe input dataset up to a constant shift, which is sufficient for computing approximate k -means. Theorem IV.4.
For any ǫ ∈ (0 , / and δ ∈ (0 , , suppose that in Algorithm 6, π satisfiesLemma III.2, π satisfies Lemma IV.1, and kmeans ( S mi =1 S ′ i , w, k ) returns the optimal k -meanscenters of the dataset S mi =1 S ′ i with weights w . Then the output X is a (1 + ǫ ) / (1 − ǫ ) -approximation for k -means clustering of S mi =1 P i with probability ≥ (1 − δ ) , the total communication cost over all the data sources is O ( mkǫ − log n ) , and the complexity at each data source is ˜ O ( ndǫ − ) ,assuming min( n, d ) ≫ m, k, /ǫ, and /δ .Proof. For 1), let S ′ := ( S mi =1 S ′ i , ∆ , w ) , where ( S mi =1 S ′ i , , w ) is the overall coreset constructedby line 3 of Algorithm 6, and ∆ is a constant satisfying Lemma IV.1 for the input dataset { P ′ i } mi =1 as in line 3 of Algorithm 6. Let P := S mi =1 P i , and X ∗ be the optimal k -means centersfor P . Then with probability ≥ (1 − δ ) , we havecost ( P, X ) ≤ (1 + ǫ ) cost ( π ( P ) , X ′ ) (69) ≤ (1 + ǫ ) (1 − ǫ ) cost ( S ′ , X ′ ) (70) ≤ (1 + ǫ ) (1 − ǫ ) cost ( S ′ , π ( X ∗ )) (71) ≤ (1 + ǫ ) (1 − ǫ ) cost ( π ( P ) , π ( X ∗ )) (72) ≤ (1 + ǫ ) (1 − ǫ ) cost ( P, X ∗ ) , (73)where (69) is by Lemma III.2 (note that π ( X ) = X ′ ), (70) is by Lemma IV.1 (note thatcost ( S ′ , X ′ ) = cost (( S mi =1 S ′ i , , w ) , X ′ ) + ∆ ), (71) is because X ′ is optimal in minimizingcost ( S ′ , · ) , (72) is again by Lemma IV.1, and (73) is again by Lemma III.2.For 2), only line 3 incurs communication cost. By Theorem IV.3, we know that applyingBKLW to a distributed dataset { P ′ i } mi =1 with dimension d ′ incurs a cost of O ( mkd ′ /ǫ ) , and byLemma III.2, we know that d ′ = O (log n/ǫ ) , which yields the desired result.For 3), the JL projection at each data source incurs a complexity of O ( ndd ′ ) = O ( nd log n/ǫ ) .By Theorem IV.3, applying BKLW incurs a complexity of O ( nd ′ · min( n, d ′ )) = O ( n log n/ǫ ) at each data source. Together, the complexity is O ( ndǫ log n + nǫ log n ) = ˜ O ( nd/ǫ ) . Discussion:
Comparing Theorems IV.4 and IV.3, we see that for d ≫ log n (e.g., d = Ω( n ) ),Algorithm 6 can significantly reduce the communication cost and the complexity at data sources,while achieving a similar (1 + O ( ǫ )) -approximation as BKLW. Note that although the possibilityof applying another DR method before BKLW was mentioned in [17], no result was given there.Meanwhile, although one could develop a distributed counterpart of Algorithm 4 that appliesJL projection after BKLW, its performance will not be competitive. Specifically, using similar analysis, this approach incurs the same order of communication cost and complexity as BKLW.Meanwhile, the JL projection introduces additional error, causing its overall approximation errorto be larger. It is thus unnecessary to consider this algorithm.Furthermore, we note that repeated application of DR/CR is unnecessary in the distributedsetting. This is because after one round of BKLW (with or without applying JL projectionbeforehand), we already reduce the cardinality to O ( ǫ − ( k /ǫ + log(1 /δ )) + mk log( mk/δ )) andthe dimension to O ( k/ǫ ) , both constant in the size ( n, d ) of the original dataset. Meanwhile,this round incurs a communication cost that scales with ( n, d ) as O (log n ) (with JL projection)or O ( d ) (without JL projection), and a complexity that scales as ˜ O ( nd ) (with JL projection)or O ( nd · min( n, d )) (without JL projection). Therefore, any possible reduction in the cost (orthe complexity) achieved by further reducing the cardinality or dimension will be dominated bythe cost (or the complexity) in the first round. Hence, repeated application of DR/CR will notimprove the order of the communication cost or the complexity. C. Summary of Comparison
We are now ready to compare the performances of all the proposed algorithms and their bestexisting counterparts in both centralized (i.e., single data source) and distributed (i.e., multipledata sources) settings.To ensure the same approximation error for all the algorithms, we set the error parameter ‘ ǫ ’ to ǫ for Algorithms 2 and 4, ǫ for FSS, ǫ for Algorithm 5, ǫ for BKLW, and ǫ for Algorithm 6,where for any ǫ ∈ (0 , , ǫ satisfies (1+ ǫ ) / (1 − ǫ ) = 1+ ǫ , ǫ satisfies (1+ ǫ ) / (1 − ǫ ) = 1+ ǫ , ǫ satisfies (1 + ǫ ) / (1 − ǫ ) = 1 + ǫ , ǫ satisfies (1 + ǫ ) / (1 − ǫ ) = 1 + ǫ , and ǫ satisfies (1 + ǫ ) / (1 − ǫ ) = 1 + ǫ .The comparison, summarized in Table I, is in terms of the communication cost and thecomplexity at the data source(s) for achieving a (1 + ǫ ) -approximation for k -means clustering ofan input dataset of cardinality n and dimension d , where the first four rows are for the centralizedsetting and the last two rows are for the distributed setting. Clearly, for high-dimensional datasetssatisfying d ≫ log n , the best proposed algorithm significantly outperforms the best existingalgorithm in both the centralized and the distributed settings. TABLE IS
UMMARY OF C OMPARISON
Algorithm Communication cost Computation complexityFSS [8] O ( kd/ǫ ) O ( nd · min( n, d )) JL + FSS (Alg. 2) O ( k log n/ǫ ) ˜ O ( nd/ǫ ) FSS + JL (Alg. 4) ˜ O ( k /ǫ ) O ( nd · min( n, d )) JL + FSS + JL (Alg. 5) ˜ O ( k /ǫ ) ˜ O ( nd/ǫ ) BKLW [17] O ( mkd/ǫ ) O ( nd · min( n, d )) JL + BKLW (Alg. 6) O ( mk log n/ǫ ) ˜ O ( nd/ǫ ) V. P
ERFORMANCE E VALUATION
In this section, we use experiments on real datasets to validate our analysis about the proposedjoint DR and CR algorithms in comparison with the existing algorithms in both the single-sourceand the multiple-source cases.
A. Datasets and Metrics
We use two datasets in our experiments: (1) MNIST training dataset [30], a handwritten digitsdataset which has , images in -dimensional space; (2) NeurIPS Conference Papers1987-2015 dataset [31], a word counts dataset of the NeurIPS conference papers published from1987 to 2015, which has , instances (words) with , attributes (papers). Both of thesetwo datasets are normalized to [ − , with zero mean. In the case of multiple data sources, werandomly partition each dataset among data sources.We measure the performance by (i) the approximation error, measured by the normalized k -means cost cost ( P, X ) / cost ( P, X ∗ ) , where X is the set of centers returned by the evaluatedalgorithm and X ∗ is the set of centers computed from P , (ii) the normalized communicationcost, measured by the ratio between number of scalars transmitted by the data source(s) and thesize of P , and (iii) the complexity at the data source(s), measured by the running time of theevaluated DR/CR algorithm. We set k = 2 in all the experiments. Because of the randomnessof the algorithms, we repeat each test for Monte Carlo runs.
B. Evaluated Algorithms
In the case of a single data source, we evaluate the following algorithms:
1) “FSS”: the benchmark algorithm introduced in [8],2) “JL+FSS”: Algorithm 2, where we use JL projection before applying FSS,3) “FSS+JL”: Algorithm 4, where we use JL projection after applying FSS, and4) “JL+FSS+JL”: Algorithm 5, where we apply JL projection both before and after FSS.In the case of multiple data sources, we evaluate the following algorithms:1) “BKLW”: the benchmark algorithm from [17], and2) “JL+BKLW”: Algorithm 6, where we apply JL projection before BKLW.In both cases, we have tuned the parameters of both the benchmark and proposed algorithms tomake all the algorithms achieve a similar empirical approximation error. As a baseline, we alsoinclude the naive method of “no reduction (NR)”, i.e., transmitting the raw data.
C. Results1) Single data source:
The results for the case of single data source (that computes andtransmits a data summary to a remote server) are given in Figure 1 and Table II. Note that bydefinition, the baseline (NR) has a normalized k -means cost of , a normalized communicationcost of 1, and no computation at the data source. We observe the following: (i) Compared to thenaive method of transmitting the raw data (NR), the proposed algorithms can dramatically reducethe communication cost (by > ) with a moderate increase in the k -means cost ( < ). (ii)Compared to the benchmark (FSS), the proposed algorithms can achieve a similar or smaller k -means cost while significantly reducing the communication cost and/or complexity. (iii) Betweenthe approaches of DR+CR (JL+FSS) and CR+DR (FSS+JL), we see that the DR+CR approachyields a better performance for the NeurIPS dataset, where JL+FSS has a substantially smallerrunning time than FSS+JL but similar k -means cost and communication cost. This is because log n ≪ min( n, d ) for this dataset, allowing JL+FSS to significantly reduce the complexitycompared with FSS+JL without blowing up the communication cost. (iv) Repeated applicationsof DR/CR (JL+FSS+JL) can further reduce the communication cost for high-dimensional datasets(e.g., NeurIPS) while obtaining comparable performance in the other metrics. TABLE IIS
INGLE - SOURCE C ASE : N
ORMALIZED C OMMUNICATION C OST
Dataset NR FSS JL+FSS FSS+JL JL+FSS+JLMNIST 1 8.95e-3 5.82e-3 5.82e-3 5.97e-3NeurIPS 1 5.87e-3 3.60e-3 3.59e-3 2.84e-39
Dataset NR FSS JL+FSS FSS+JL JL+FSS+JLMNIST 1 8.95e-3 5.82e-3 5.82e-3 5.97e-3NeurIPS 1 5.87e-3 3.60e-3 3.59e-3 2.84e-39 CD F JL+FSSFSS+JLJL+FSS+JLFSS
20 30 40 50 60 70 80 90 10000.10.20.30.40.50.60.70.80.91 CD F JL+FSSFSS+JLJL+FSS+JLFSS (a) MNIST CD F JL+FSSFSS+JLJL+FSS+JLFSS
10 20 30 40 50 60 7000.10.20.30.40.50.60.70.80.91 CD F JL+FSSFSS+JLJL+FSS+JLFSS (b) NeurIPS
Fig. 1. Single-source case: normalized k -means cost and running time CD F JL+BKLWBKLW
26 28 30 32 34 36 3800.10.20.30.40.50.60.70.80.91 CD F JL+BKLWBKLW (a) MNIST CD F JL+BKLWBKLW
44 45 46 47 48 49 50 51 52 53 5400.10.20.30.40.50.60.70.80.91 CD F JL+BKLWBKLW (b) NeurIPS
Fig. 2. Multiple-source case: normalized k -means cost and running time
2) Multiple data sources:
Figure 2 and Table III show the results for the case of multipledata sources. We see that for both datasets, the proposed algorithm (JL+BKLW) achieves a k -means cost comparable to the benchmark (BKLW), while incurring lower complexity and lowercommunication cost. This demonstrates the benefit of suitably combining the JL projection withother data reduction methods; recall that applying JL projection after BKLW will not reduce thecommunication cost or the complexity as explained after Theorem IV.4. D. Summary of Observations
Our experimental results verified the following: • Solving k -means based on data summaries generated by DR/CR methods can effectivelyreduce the communication cost without incurring a high complexity at the data sources,while providing a solution of reasonable quality. • Suitable combination of DR and CR methods will further improve the communication costand the complexity while maintaining a similar solution quality.VI. E
XTENSION TO J OINT
DR, CR,
AND Q UANTIZATION
Besides cardinality and dimensionality, the volume of a dataset also depends on its precision ,defined as the number of bits used to represent each attribute in the dataset. While DR and CR TABLE IIIM
ULTIPLE - SOURCE C ASE : N
ORMALIZED C OMMUNICATION C OST
Dataset NR BKLW JL+BKLWMNIST 1 1.97e-2 1.69e-2NeurIPS 1 1.28e-2 1.05e-2 methods can be used to reduce the dimensionality and the cardinality, quantization techniques[21] can be used to reduce the precision and hence further reduce the communication cost. Whilethe optimal quantization in support of k -means is worth a separate study, our focus here is onhow to properly combine a given quantizer with the proposed DR/CR methods. To this end, wewill use a simple rounding-based quantizer as a concrete example. A. Rounding-based Quantization
Given a scalar x ∈ R , we denote the b -bit binary floating number representation of x by x = ( − sign ( x ) × e x × ( a (0) + a (1) × − + . . . + a ( b − − m e ) × − ( b − − m e ) ) , (74)where sign ( x ) = 0 if x ≥ and sign ( x ) = 1 if x < , m e is the number of exponent bits, e x is the m e -bit exponent of x , and a ( · ) ∈ { , } are the significant bits ( a (0) ≡ ). The rounding-basedquantizer Γ with s significant bits is defined as Γ( x ) := ( − sign ( x ) × e x × ( a (0) + a (1) × − + . . . + a ( s ) × − s + a ′ ( s ) × − s ) , (75)where a ′ ( s ) is the result of rounding the remaining bits (0: rounding down; 1: rounding up).For simplicity of notation, we also use p ′ := Γ( p ) to denote the element-wise rounding-basedquantization of a data point p = ( p i ) di =1 ∈ R d . Defining the maximum quantization error as ∆ QT := max p ∈ P k p − p ′ k , we know that by the definition of rounding-based quantizer, thequantization error in each element satisfies | p i − p ′ i | ≤ e pi − s ≤ | p i | − s since | p i | ≥ e pi .Therefore, the maximum quantization error is bounded as ∆ QT = max p ∈ P vuut d X i =1 ( p i − p ′ i ) ≤ max p ∈ P vuut d X i =1 − s p i = 2 − s max p ∈ P k p k . (76) B. Approximation Error Analysis
We now analyze the k -means approximation error after we add quantization to the proposedcommunication-efficient k -means algorithms. As DR and CR can generate data points of arbitraryvalues that may not be representable with s significant bits, we add quantization after all the DR/CR steps. That is, right before a data source reports its dimensionality-reduced coreset ( S, ∆ , w ) to the server, it will apply the rounding-based quantizer Γ and report ( S QT , ∆ , w ) instead , where S QT := { Γ( p ) : p ∈ S } . The results are summarized in the following theorem: Theorem VI.1.
Let X denote the optimal k -means centers computed by the server based onthe received coreset, X ∗ denote the optimal k -means centers based on the original dataset, and ∆ D denote the diameter of the input space. In Algorithm 2, suppose that π satisfies Lemma III.2 with ǫ , π generates an ǫ -coresetwith probability ≥ − δ , and π QT is a quantizer with maximum error ∆ QT . If we updateLine 4 to: S ′ QT ← π QT ( S ′ ) and report ( S ′ QT , ∆ , w ) to the server, then the approximationerror will be:cost ( P, X ) ≤ (1 + ǫ ) (1 + ǫ )(1 − ǫ ) cost ( P, X ∗ ) + (1 + ǫ ) (1 − ǫ ) 4 n ∆ D ∆ QT (77) with probability at least (1 − δ ) . In Algorithm 4, suppose that π satisfies Lemma III.3 with ǫ , π generates an ǫ -coresetwith probability ≥ − δ , and π QT is a quantizer with maximum error ∆ QT . If we updateLine 4 to: S ′ QT ← π QT ( S ′ ) and report ( S ′ QT , ∆ , w ) to the server, then the approximationerror will be:cost ( P, X ) ≤ (1 + ǫ ) (1 + ǫ )(1 − ǫ ) cost ( P, X ∗ ) + (1 + ǫ ) (1 − ǫ ) 4 n ∆ D ∆ QT (78) with probability at least (1 − δ ) . In Algorithm 5, suppose that π (1)1 satisfies Lemma III.2 with ǫ (1)1 , π (2)1 satisfies Lemma III.3with ǫ (2)1 , π generates an ǫ -coreset with probability ≥ − δ , and π QT is a quantizer withmaximum error ∆ QT . If we update Line 5 to: S ′ QT ← π QT ( S ′ ) and report ( S ′ QT , ∆ , w ) tothe server, then the approximation error will be:cost ( P, X ) ≤ (1 + ǫ (1)1 ) (1 + ǫ )(1 + ǫ (2)1 ) (1 − ǫ ) cost ( P, X ∗ ) + (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) (1 − ǫ ) 4 n ∆ D ∆ QT (79) with probability at least (1 − δ ) . In Algorithm 6, suppose that π satisfies Lemma III.2, π satisfies Lemma IV.1, and π QT is a quantizer with maximum error ∆ QT . If we update Line 3 to: run π on the distributed Here we only apply quantization to the coreset points in S as their transfer dominates the communication cost, but ourapproach can be extended to other cases. dataset { P ′ i } mi =1 to compute a local coreset ( S ′ i , , w ) at each data source i and report ( π QT ( S ′ i ) , , w ) to the server, then the approximation error will be:cost ( P, X ) ≤ (1 + ǫ ) (1 − ǫ ) cost ( P, X ∗ ) + (1 + ǫ ) (1 − ǫ ) n ∆ D ∆ QT (80) with probability at least (1 − δ ) .Proof. We only present the proof for Algorithm 5 with the incorporation of quantization, as theproofs for the other algorithms are similar. Consider a coreset ( S, ∆ , w ) and a set of k -meanscenters X . If we quantize S into S QT with a maximum quantization error of ∆ QT , then foreach coreset point q ∈ S and its quantized version q ′ ∈ S QT , we have k q − q ′ k ≤ ∆ QT . Onthe other hand, from [4], the k -means cost function is D -Lipschitz-continuous, which yields | cost ( q, X ) − cost ( q ′ , X ) | ≤ D ∆ QT . Thus, the difference in the k -means cost between theoriginal and the quantized coresets is bounded by | cost (( S, ∆ , w ) , X ) − cost (( S QT , ∆ , w ) , X ) | ≤ D ∆ QT X q ∈ S w ( q ) , (81)as cost (( S, ∆ , w ) , X ) = P q ∈ S w ( q ) cost ( q, X ) + ∆ .Following the arguments in the proof of Theorem III.5, we see that with probability ≥ (1 − δ ) :cost ( P, X ) ≤ (1 + ǫ (1)1 ) cost ( P ′ , π (1)1 ( X )) (82) ≤ (1 + ǫ (1)1 ) − ǫ cost (( S, ∆ , w ) , π (1)1 ( X )) (83) ≤ (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) − ǫ cost (( S ′ , ∆ , w ) , π (2)1 ◦ π (1)1 ( X )) (84) ≤ (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) − ǫ ( cost (( S ′ QT , ∆ , w ) , π (2)1 ◦ π (1)1 ( X )) + 2 n ∆ D ∆ QT ) (85) ≤ (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) − ǫ ( cost (( S ′ QT , ∆ , w ) , π (2)1 ◦ π (1)1 ( X ∗ )) + 2 n ∆ D ∆ QT ) (86) ≤ (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) − ǫ ( cost (( S ′ , ∆ , w ) , π (2)1 ◦ π (1)1 ( X ∗ )) + 4 n ∆ D ∆ QT ) (87) ≤ (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) − ǫ cost (( S, ∆ , w ) , π (1)1 ( X ∗ )) + (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) − ǫ n ∆ D ∆ QT (88) ≤ (1 + ǫ (1)1 ) (1 + ǫ )(1 + ǫ (2)1 ) − ǫ cost ( P ′ , π (1)1 ( X ∗ )) + (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) − ǫ n ∆ D ∆ QT (89) ≤ (1 + ǫ (1)1 ) (1 + ǫ )(1 + ǫ (2)1 ) − ǫ cost ( P, X ∗ ) + (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) − ǫ n ∆ D ∆ QT , (90)where (85) and (87) are by (81) and the property that the coreset ( S, ∆ , w ) constructed by sensitivity sampling satisfies P q ∈ S w ( q ) = n (the cardinality of P ) . C. Configuration of Joint DR, CR, and Quantization
Based on the analysis of the approximation error under given DR, CR, and QT (quantization)methods, we aim to answer the following question: how can we configure the DR, CR, andQT methods such that we can minimize the communication cost while keeping the approxima-tion error within a given bound? We will present detailed results for the four-step procedureJL+FSS+JL+QT, as the results for the other procedures are similar.
1) Problem Formulation:
Let Y denote a desired bound on the approximation error and − δ the desired confidence level, i.e., cost ( P, X ) ≤ Y cost ( P, X ∗ ) with probability ≥ − δ , where X is the computed k -means solution and X ∗ the optimal solution. By Theorem VI.1, the QTstep introduces an additive error. We now convert it into a multiplicative error to enforce thebound Y . To this end, suppose that we are given a lower bound E on the optimal k -meanscost cost ( P, X ∗ ) . For example, by [32], we can estimate E by selecting O ( k ) points from P according to a certain probability distribution, repeating this process for log(1 /δ ) times, andoutputting the set X of selected points with the minimum cost ( P, X ) . This result is proven tobe at most 20-time worse than the optimal solution, i.e., E := cost ( P, X ) / ≤ cost ( P, X ∗ ) ,with probability at least − δ . Define ǫ QT := n ∆ D ∆ QT E . Then based on (79), we have:cost ( P, X ) ≤ (1 + ǫ (1)1 ) (1 + ǫ )(1 + ǫ (2)1 ) (1 − ǫ ) cost ( P, X ∗ ) + (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) (1 − ǫ ) 4 n ∆ D ∆ QT ≤ (1 + ǫ (1)1 ) (1 + ǫ )(1 + ǫ (2)1 ) (1 − ǫ ) cost ( P, X ∗ ) + (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) (1 − ǫ ) 4 n ∆ D ∆ QT E cost ( P, X ∗ )= (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) (1 − ǫ ) ((1 + ǫ (1)1 ) (1 + ǫ )(1 + ǫ (2)1 ) + ǫ QT ) cost ( P, X ∗ ) . (91)Let f ( ǫ (1)1 , ǫ , ǫ (2)1 , ǫ QT ) denote the communication cost as a function of the configurationparameters ǫ (1)1 , ǫ , ǫ (2)1 , and ǫ QT . Our goal is to find the optimal configuration that minimizesthe communication cost while satisfying the given bound on the approximation error: min ǫ (1)1 ,ǫ ,ǫ (2)1 ,ǫ QT X := f ( ǫ (1)1 , ǫ , ǫ (2)1 , ǫ QT ) (92a) While the sensitivity sampling procedure in [8] only guarantees that E [ P q ∈ S w ( q )] = n (expectation over S ), a variationof this procedure proposed in [11] guarantees P q ∈ S w ( q ) = n deterministically. FSS based on the sampling procedure in [11]still generates an ǫ -coreset (with probability ≥ − δ ) with a constant cardinality (precisely, O ( k ǫ log( δ ) ). s.t. Y := (1 + ǫ (1)1 ) (1 + ǫ (2)1 ) (1 − ǫ ) ((1 + ǫ (1)1 ) (1 + ǫ )(1 + ǫ (2)1 ) + ǫ QT ) ≤ Y , (92b)where X denotes the communication cost and Y denotes (an upper bound on) the approximationerror. The parameter δ is set to − (1 − δ ) / to satisfy the desired confidence level.
2) Analysis:
The communication cost X is dominated by the transfer of the dimensionality-reduced, quantized coreset S ′ QT . Let its cardinality, dimensionality, and precision be n ′ , d ′ , and b ′ . By [8], the cardinality of an ǫ -coreset generated by FSS is n ′ = O ( k log ( k ) log(1 /δ ) ǫ ) . To satisfyLemma III.2 with ǫ (2)1 , the dimensionality needs to satisfy d ′ = O ( log( n ′ k/δ )( ǫ (2)1 ) ) . By the analysisof quantization error in Section VI-A, b ′ = O (log( n √ dǫ QT )) . Denoting the constant factors in thesebig- O terms by C , C , and C , we have X ≈ n ′ · d ′ · b ′ = C k log ( k ) log(1 /δ ) ǫ · C log( n ′ k/δ )( ǫ (2)1 ) · C log( n √ dǫ QT ) (93) = ˜ C · log( ˜ C /ǫ ) ǫ ( ǫ (2)1 ) · log( ˜ C Y (1 − ǫ ) / (1 + ǫ (2)1 ) − (1 + ǫ )(1 + ǫ (2)1 ) ) , (94)where ˜ C = k log ( k ) log(1 /δ ) C C C , ˜ C = k log ( k ) log(1 /δ ) /δ , and ˜ C = n √ d . Assuming k ≥ , by plugging the constant factors from [33]–[35] into Theorem 36 from [8], we obtain C = 54912(1 + log (3))(1 + log (26 / / . By JL projection, d ′ ≤ ⌈ ∗ log( n ′ kδ ) /ǫ ⌉ , andtherefore C could be . Assuming n > and E ≥ , C could be 2. Equation (94) is obtainedby replacing ǫ QT by its maximum value derived from (92b).While solving (92) in the general case is nontrivial, we note that for the rounding-basedquantizer defined in Section VI-A, ǫ QT only has a finite number of possible values, correspondingto the number of significant bits s = 1 , . . . , b − − m e . Thus, under the simplifying constraintof ǫ (1)1 = ǫ = ǫ (2)1 =: ǫ , we can enumerate each possible value of ǫ QT , compute the maximum ǫ under this ǫ QT from (92b), and plug it into (94) to evaluate X . We can then select theconfiguration that yields the minimum X . D. Experiments for Joint DR, CR, and Quantization
We now use the datasets from Section V to evaluate the impact of adding quantization tothe previously proposed algorithms. We also use the same performance metrics, except that thenormalized communication cost becomes the ratio between the number of transmitted bits andthe total number of bits in the original dataset (in double precision). For tractability, in theexperiments we set all the ǫ values except ǫ QT to be equal when solving (92). All the results N o r m a li z ed k m ean s c o s t s JL+FSS+QTFSS+JL+QTJL+FSS+JL+QTFSS+QT (a) Normalized k -means cost N o r m a li z ed c o mm un i c a t i on c o s t s 10 -3 JL+FSS+QTFSS+JL+QTJL+FSS+JL+QTFSS+QT (b) Normalized communication cost R unn i ng t i m e s ( s ) JL+FSS+QTFSS+JL+QTJL+FSS+JL+QTFSS+QT (c) Running time (s)
Fig. 3. Single-source case with quantization: MNIST N o r m a li z ed k m ean s c o s t s JL+FSS+QTFSS+JL+QTJL+FSS+JL+QTFSS+QT (a) Normalized k -means cost N o r m a li z ed c o mm un i c a t i on c o s t s JL+FSS+QTFSS+JL+QTJL+FSS+JL+QTFSS+QT (b) Normalized communication cost R unn i ng t i m e s ( s ) JL+FSS+QTFSS+JL+QTJL+FSS+JL+QTFSS+QT (c) Running time (s)
Fig. 4. Single-source case with quantization: NeurIPS are averaged over 10 Monte Carlo runs .
1) Evaluated Algorithms:
In the case of a single data source, we evaluate the following: • “FSS+QT”: the quantization-added version of FSS [8], • “JL+FSS+QT”: the quantization-added version of Algorithm 2, • “FSS+JL+QT”: the quantization-added version of Algorithm 4, and • “JL+FSS+JL+QT”: the quantization-added version of Algorithm 5.In the case of multiple data sources, we evaluate the following algorithms: • “BKLW+QT”: the quantization-added version of BKLW [17], and • “JL+BKLW+QT”: the quantization-added version of Algorithm 6.For each algorithm, we construct a data summary under each configuration that corresponds toa possible number of significant bits s , and then solve k -means based on the data summary.Specifically, since the IEEE Standard 754 floating number representation [36] consists of 53significant bits, we enumerate s = 1 , . . . , .
2) Results:
The results for the single data source scenario are given in Figures 3–4. Wehave the following key observations: (i) Compared with the methods without quantization (the The number of Monte Carlo runs is limited by the running time of the experiment, which takes around 30 hours to completeone Monte Carlo run for all the algorithms and all the settings. N o r m a li z ed k m ean s c o s t s JL+BKLW+QTBKLW+QT (a) Normalized k -means cost N o r m a li z ed c o mm un i c a t i on c o s t s JL+BKLW+QTBKLW+QT (b) Normalized communication cost R unn i ng t i m e s ( s ) JL+BKLW+QTBKLW+QT (c) Running time (s)
Fig. 5. Multiple-source case with quantization: MNIST right-most points under s = 53 ), adding quantization can further reduce the communicationcost by up to / without increasing the k -means cost or the running time. (ii) However, itis not trivial to find the optimal configuration to achieve a comparable k -means cost with theleast communication cost and running time, e.g., very small and very large values of s bothlead to high communication costs, justifying the need of solving (92). (iii) Overall, the four-step procedure JL+FSS+JL+QT achieves the best communication cost and running time with acomparable k -means cost.The results for the multiple data source scenario are given in Figures 5–6. We have similarobservations as in the case of a single data source: (i) Adding quantization to DR/CR methods canfurther reduce the communication cost by up to without increasing the k -means cost or therunning time. (ii) Choosing a proper configuration (by selecting the optimal number of significantbits to retain in quantization) is nontrivial. (iii) Compared with BKLW+QT, JL+BKLW+QT canreduce both the communication cost and the computational complexity at data sources, which isconsistent with the comparison between BKLW and JL+BKLW (Section IV-B). We also note thatin both scenarios, the proposed algorithms show better performance relative to the benchmarkalgorithms (i.e., FSS+QT, BKLW+QT) for NeurIPS than MNIST. This is because NeurIPS hasconsiderably more attributes than MNIST, leaving more room of improvement for the design ofdata reduction algorithms. VII. C ONCLUSION
In this paper, we considered the problem of jointly applying DR and CR methods to effi-ciently compute the k -means centers for a large high-dimensional dataset located at remote datasource(s). Through a comprehensive analysis of the approximation error, the communication cost,and the complexity of various combinations of state-of-the-art DR/CR methods, we proved that itis possible to achieve a near-optimal approximation of k -means at a near-linear complexity at the N o r m a li z ed k m ean s c o s t s JL+BKLW+QTBKLW+QT (a) Normalized k -means cost N o r m a li z ed c o mm un i c a t i on c o s t s JL+BKLW+QTBKLW+QT (b) Normalized communication cost R unn i ng t i m e s ( s ) JL+BKLW+QTBKLW+QT (c) Running time (s)
Fig. 6. Multiple-source case with quantization: NeurIPS data source(s) and a very low (constant or logarithmic) communication cost. In the process, wedeveloped algorithms based on novel combinations of existing DR/CR methods that outperformedtwo state-of-the-art algorithms in the cases of a single data source and multiple data sources,respectively. We also demonstrated how to combine DR/CR methods with quantization methodsto further reduce the communication cost without compromising the other performance metrics.Our findings were validated through experiments on two real datasets.R
EFERENCES [1] H. Lu, T. He, S. Wang, C. Liu, M. Mahdavi, V. Narayanan, K. S. Chan, and S. Pasteris, “Communication-efficient k-meansfor edge-based machine learning,” in
ICDCS , November 2020.[2] A. K. Jain, “Data clustering: 50 years beyond k-means,”
Pattern Recognition Letters , vol. 31, no. 8, pp. 651–666, 2010.[3] H. Lu, M.-J. Li, T. He, S. Wang, V. Narayanan, and K. S. Chan, “Robust coreset construction for distributed machinelearning,” in
IEEE Globecom , December 2019.[4] ——, “Robust coreset construction for distributed machine learning,”
IEEE Journal on Selected Areas in Communications ,vol. 38, no. 10, pp. 2400 – 2417, October 2020.[5] D. Aloise, A. Deshpande, P. Hansen, and P. Popat, “NP-hardness of Euclidean sum-of-squares clustering,”
MachineLearning , vol. 75, no. 2, pp. 245–248, 2009.[6] M. Mahajan, P. Nimbhorkar, and K. R. Varadarajan, “The planar k-means problem is NP-hard,”
Theoretical ComputerScience , vol. 442, no. 13, pp. 13–21, 2012.[7] K. Makarychev, Y. Makarychev, and I. Razenshteyn, “Performance of Johnson-Lindenstrauss transform for k-means andk-medians clustering,” in
STOC , June 2019.[8] D. Feldman, M. Schmidt, and C. Sohler, “Turning big data into tiny data: Constant-size coresets for k-means, PCA, andprojective clustering,” 2018. [Online]. Available: https://arxiv.org/abs/1807.04518[9] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,”
Proceedings of the IEEE ,vol. 107, no. 11, pp. 2204–2239, 2019.[10] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resourceconstrained edge computing systems,”
IEEE Journal on Selected Areas in Communications , vol. 37, no. 6, pp. 1205–1221,2019.[11] M. F. Balcan, S. Ehrlich, and Y. Liang, “Distributed k-means and k-median clustering on general topologies,” in
NIPS ,December 2013. [12] C. Boutsidis, A. Zouzias, and P. Drineas, “Random projections for k-means clustering,” in NIPS , December 2010.[13] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu, “Dimensionality reduction for k-means clustering and lowrank approaximation,” in
STOC , June 2015.[14] S. Har-Peled and S. Mazumdar, “On coresets for k-means and k-median clustering,” in
STOC , 2004.[15] D. Feldman and M. Langberg, “A unified framework for approximating and clustering data,” in
STOC , June 2011.[16] D. Feldman, M. Schmidt, and C. Sohler, “Turning big data into tiny data: Constant-size coresets for k-means, PCA, andprojective clustering,” in
SODA , 2013.[17] M. F. Balcan, V. Kanchanapally, Y. Liang, and D. Woodruff, “Improved distributed principal component analysis,” in
NIPS ,December 2014.[18] Y. Mao, Z. Xu, P. Ping, and L. Wang, “An optimal distributed k-means clustering algorithm based on CloudStack,” in
International Conference on Frontier of Computer Science and Technology , August 2015.[19] M. C. Naldi and R. J. G. B. Campello, “Distributed k-means clustering with low transmission cost,” in
Brazilian Conferenceon Intelligent Systems , October 2013.[20] C. R. Giannella, H. Kargupta, and S. Datta, “Approximate distributed k-means clustering over a peer-to-peer network,”
IEEE Trans. KDE , vol. 21, pp. 1372–1388, October 2009.[21] K. Sayood,
Introduction to Data Compression, Fourth Edition , 4th ed. San Francisco, CA, USA: Morgan KaufmannPublishers Inc., 2012.[22] A. Gersho and R. M. Gray,
Vector quantization and signal compression . Springer Science & Business Media, 2012, vol.159.[23] A. Ben-Israel and T. N. E. Greville,
Generalized Inverses: Theory and Applications . Springer, 2003.[24] W. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space,” in
Conference in ModernAnalysis and Probability , 1982.[25] P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in
ACM STOC ,1998.[26] D. Achlioptas, “Database-friendly random projections: Johnson-Lindenstrauss with binary coins,”
Journal of Computerand System Sciences , vol. 66, no. 4, pp. 671–687, 2003.[27] B. Klartag and S. Mendelson, “Empirical processes and random projections,”
Journal of Functional Analysis , vol. 225,no. 1, pp. 229–245, August 2005.[28] M. F. Balcan, V. Kanchanapally, Y. Liang, and D. Woodruff, “Improved distributed principal component analysis,” 2014.[Online]. Available: https://arxiv.org/abs/1408.5823[29] A. Aggarwal, A. Deshpande, and R. Kannan, “Adaptive sampling for k-means clustering,” in
APPROX , August 2009.[30] Y. LeCun, C. Cortes, and C. Burges, “The MNIST database of handwritten digits,” http://yann.lecun.com/exdb/mnist/,1998.[31] V. Perrone, P. A. Jenkins, D. Spano, and Y. W. Teh, “Poisson random fields for dynamic feature models,” arXiv preprintarXiv:1611.07460 , 2016.[32] A. Aggarwal, A. Deshpande, and R. Kannan, “Adaptive sampling for k-means clustering,” in
Approximation, Randomization,and Combinatorial Optimization. Algorithms and Techniques . Springer, 2009, pp. 15–28.[33] M. Langberg and L. J. Schulman, “Universal ǫ approximators for integrals,” in SODA , 2010.[34] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learnability and the vapnik-chervonenkis dimension,”
Journal of the ACM (JACM) , vol. 36, no. 4, pp. 929–965, 1989.[35] Y. Li, P. M. Long, and A. Srinivasan, “Improved bounds on the sample complexity of learning,”
Journal of Computer andSystem Sciences , vol. 62, no. 3, pp. 516–527, 2001.9