[PDF] Adapting k -means algorithms for outliers

Abstract

Full PDF

AAdapting k -means algorithms for outliers Christoph GrunauETH Z¨urich V´aclav RozhoˇnETH Z¨urichMarch 2020

Abstract

This paper shows how to adapt several simple and classical sampling-based algorithms for the k -means problem to the setting with outliers.Recently, Bhaskara et al. (NeurIPS 2019) showed how to adapt theclassical k -means++ algorithm to the setting with outliers. However,their algorithm needs to output O (log( k ) · z ) outliers, where z is thenumber of true outliers, to match the O (log k )-approximation guaranteeof k -means++. In this paper, we build on their ideas and show how toadapt several sequential and distributed k -means algorithms to thesetting with outliers, but with substantially stronger theoreticalguarantees: our algorithms output (1 + ε ) z outliers while achieving an O (1 /ε )-approximation to the objective function. In the sequential world,we achieve this by adapting a recent algorithm of Lattanzi and Sohler(ICML 2019). In the distributed setting, we adapt a simple algorithm ofGuha et al. (IEEE Trans. Know. and Data Engineering 2003) and thepopular k -means (cid:107) of Bahmani et al. (PVLDB 2012).A theoretical application of our techniques is an algorithm with runningtime ˜ O ( nk /z ) that achieves an O (1)-approximation to the objectivefunction while outputting O ( z ) outliers, assuming k (cid:28) z (cid:28) n . This iscomplemented with a matching lower bound of Ω( nk /z ) for this problemin the oracle model. Clustering is a fundamental tool in machine learning and data analysis. It aimsto partition a given set of objects into clusters in such a way that similar objectsend up in the same cluster. The classical way of approaching the clusteringproblem is via the k -means formulation. In this formulation, one works with aset X consisting of n points in the d -dimensional Euclidean space R d and theobjective is to output a set C consisting of k centers so as to minimize the sumof squared distances of points in X to C , i.e., ϕ ( X, C ) = (cid:88) x ∈ X min c ∈ C (cid:107) x − c (cid:107) . ˜ O ( . ) hides logarithmic factors in n . We assume that the minimum distance between anytwo points is 1 and the maximum distance ∆ between any two points is bounded by poly( n ). a r X i v : . [ c s . D S ] J u l ssigning each point to its closest cluster center naturally induces a partition ofthe points where nearby points tend to end up in the same partition. k -means with Outliers One major drawback of k -means in practice is itssensitivity to outliers [GKL + k -means objective. Probably the simplest such formulation isthe k -means with outliers formulation [CKMN01]. In this formulation, oneadditionally receives a number z ∈ { , , . . . , n } . The aim is to ﬁnd a set C consisting of k cluster centers and additionally a set X out ⊆ X consisting of z outliers so as to optimize the cost of inliers X in := X \ X out . That is, weoptimize min X out ,C ϕ ( X \ X out , C ) . Known Results for k -means with Outliers It is known that the problemcan be approximated up to a constant factor in polynomial time by rounding acertain linear program [Che08, KLS18]. However, the complexity of these knownalgorithms is a large polynomial. Moreover, as we observe in Section 6, everyconstant factor approximation algorithm that outputs exactly z outliers needsto perform Ω( z ) queries in the query model.This motivates to weaken our requirements and look for fast algorithms thatare allowed to output slightly more than z outliers. There is a line of work thatmakes progress toward such algorithms [CKMN01, MOP04, GKL +

17, GLZ17,BVX19, IQM + Az outliers, for some large constant A (cid:29) z ) runningtime and, hence, to be truly applicable even for large z they need to be sped upby coreset constructions [FL11, HJLW18]. Our Contribution

In this work we aim to further close the gap betweentheory and practice. We show how to adapt a number of classical sampling-based k -means algorithms to the setting with outliers. The main idea of theadaptations is a reduction to the variant of the problem with penalties describedbelow. This is a known approach [CKMN01], but previous reductions of samplingbased algorithms [BVX19] necessarily need to output Az outliers for some largeconstant A (cid:29)

1, while we obtain algorithms that need to output only (1 + ε ) z outliers. More concretely, our contribution is threefold.First, building on ideas of [LS19], we design a simple sampling-based algorithmwith running time ˜ O ( nk/ε ). It outputs an O (1 /ε )-approximate solution whiledeclaring at most (1 + ε ) z points as outliers. This shows that sampling basedalgorithms that are known to be fast and have good practical performance canalso achieve strong theoretical guarantees.Second, we devise distributed algorithms for k -means with outliers whereeach machine needs to send only ˜ O ( k/ε ) bits. Our construction achieves an O (1)-approximation while outputting (1 + ε ) z outliers. Moreover, each machineonly needs to perform polynomial-time computation. This improves on [LG18]2ho achieve an (1 + ε )-approximation while outputting the same number ofoutliers as our algorithm, but their computation time is exponential.Third, we show that one can achieve an O (1)-approximation guaranteewhile discarding O ( z ) outliers in time ˜ O ( k · n/z ). This is done by speeding-up sampling with the Metropolis-Hastings algorithm as done in [BLHK16b,BLHK16a] together with additional ideas. This result is complemented by amatching lower bound of Ω( k · ( n/z )) for k -means/ k -median/ k -center algorithmsthat work for an arbitrary metric space accessed by distance queries. Thisimproves on [MOP04, HJLW18] who give algorithms in this setting with arunning time of ˜ O (cid:16) k · ( n/z ) (cid:17) . This is a signiﬁcant improvement for z (cid:28) n . Roadmap.

We overview the related work in Section 2 and our approachin Section 3. Then, we present our contributions in Sections 4 to 6 and ourexperiments in Section 7. Technical details are mostly deferred to appendices.

Notation

We deﬁne ϕ ( x, C ) = min c ∈ C (cid:107) x − c (cid:107) and set ϕ ( X, C ) = (cid:80) x ∈ X ϕ ( x, C ). Similarly, we deﬁne τ Θ ( x, C ) = min(Θ , ϕ ( x, C )) and τ Θ ( X, C ) = (cid:80) x ∈ X τ Θ ( x, C ). We call an algorithm an ( α, β )-approximation if itoutputs a set of k centers C and a set of z outliers X out such that ϕ ( X \ X out , C ) ≤ α OPT = αϕ ( X \ X ∗ out , C ∗ ), where OPT is the cost of a ﬁxedoptimal clustering C ∗ with a set of outliers X ∗ out , | X ∗ out | = z . Moreover, wedeﬁne X in := X \ X out and X ∗ in := X \ X ∗ out . k -means Problem It is well-known that ﬁnding the optimal solution is NP-hard [ADHP09, MNV09]. Currently the best known approximation ratio isroughly 6.36 [ANFSW19] and an approximation ratio of (1 + ε ) can be achievedfor ﬁxed dimension d [FRS19] or ﬁxed k [KSS04]. From the practical perspective,Lloyd’s heuristic [Llo82, BLSS16] is the algorithm of choice. As Lloyd’s algorithmconverges only to a local optimum, it requires a careful seeding to achievegood performance. The most popular seeding choice is the k -means ++ seeding[AV07, ORSS13]. In k -means ++ seeding (Algorithm 1) one chooses the ﬁrstcenter uniformly at random from the set of inputs points X . In each of thefollowing k − x ∈ X as a new center withprobability proportional to its current cost ϕ ( x, C ). The seeding works wellin practice and a theoretical analysis shows that even without running Lloyd’salgorithm subsequently it provides an expected O (log k )-approximation to the k -means objective [AV07]. k -means with Outliers There is a growing body of research related to the k -means with outliers problem. On the practical side, Lloyd’s algorithm isreadily adapted to the noisy setting [CG13], but the output quality still remainsdependent on the initial seeding. On the theoretical side, constant approximation3 lgorithm 1 k -means ++ seedingInput: X , k Uniformly sample c ∈ X and set C = { c } . for i ← , , . . . , k do Sample c ∈ X with probability ϕ ( c, C ) /ϕ ( X, C ) and add it to C . end for return C algorithms based on the method of successive local search [Che08, KLS18] areknown to provide a constant approximation guarantee. However, their runningtime is a large polynomial in n . We are interested in fast algorithms that areallowed to output slightly more than z outliers. Several algorithms have beenproposed [CKMN01, MOP04, GKL +

17, GLZ17, BVX19, IQM + z ) time or they need to output at least Cz outliers forsome C (cid:29)

1. The algorithms can in general be sped up by coreset constructions[FL11, GKL +

17, HJLW18, IQM + k -means with (Uniform) Penalties k -means with penalties is a diﬀerent wayof handling outliers introduced in the seminal paper of Charikar et al. [CKMN01].In the version of the problem with uniform penalties, we are given some positivenumber Θ and the goal is to output a set of centers C = { c , c , . . . , c k } so as tominimize the expression τ Θ ( X, C ) = (cid:80) x ∈ X min (cid:0) Θ , min c ∈ C (cid:107) x − c (cid:107) (cid:1) . That is,the cost of any point is bounded by a threshold Θ.It turns out that it is usually much simpler to work with k -means withpenalties than k -means with outliers. This is quite helpful because results for k -means with penalties can be turned into results for k -means with outliers[CKMN01, LG18, BVX19, BR20]. We also take this approach in this paper. Weformalize the reduction in the next section and also present our improvement ofit for sampling based algorithms. k -means with Outliers to k -means with Penalties Our approach is based on reducing the problem of k -means with outliers to theproblem of k -means with penalties. In this section, we ﬁrst review how one canreduce the problem of k -means with outliers to the problem of k -means withpenalties. Then we review an instance of this reduction by [BVX19] and providea reﬁned result which is the starting point of our work. We note that this can beseen as an instantiation of a general principle in optimization where one replacesa constraint with a penalty function known as Lagrangian relaxation [BBV04].4 .1 Review of the Previously Known Reduction We start by giving the following lemma that formalizes how an α -approximatesolution to the k -means with penalties objective can be used to obtain analgorithm providing an ( O ( α ) , O ( α ))-approximate solution to the k -means withoutliers objective. Lemma 1 ([CKMN01]) . Let C be an α -approximate solution to the k -means withpenalties objective with penalty OPT / (2 z ) ≤ Θ ≤ OPT /z . Let X out denote theset of points x for which τ Θ ( x, C ) = Θ . Then, it holds that ϕ ( X \ X out , C ) = O ( α OPT ) and | X out | = O ( αz ) .Proof. Note that the optimal solution for k -means with penalties has a cost upperbounded by ϕ ( X \ X ∗ out , C ∗ ) + | X ∗ out | Θ = OPT + z Θ. As C is an α -approximatesolution to the k -means with penalties objective, we have ϕ ( X \ X out , C ) ≤ τ Θ ( X, C ) ≤ α (OPT + z Θ) = O ( α OPT) . Moreover, paying Θ for every x ∈ X out implies | X out | ≤ τ Θ ( X,C )Θ ≤ α (OPT+ z Θ)Θ = O ( αz ) . We remark that the penalty Θ depends on OPT, so the algorithm for k -meanswith outliers needs to try this reduction for all O (log( n ∆ )) = O (log n ) powersof 2 between 1 and n ∆ .This reduction is very helpful as it allows us to easily adapt sampling-basedalgorithms like k -means ++ to the setting with penalties since their analysisgeneralises to this setting. For the case of k -means ++ , this was shown in[BVX19, Li20] in the following theorem. Algorithm 2 k -means ++ (over)seeding with penaltiesInput: A set of points X , k , (cid:96) , threshold Θ Uniformly sample c ∈ X and set C = { c } . for i ← , , . . . , (cid:96) do Sample c ∈ X with probability τ Θ ( c, C ) /τ Θ ( X, C ) and add it to C . end for return C Theorem 1. [[AV07, BVX19, Li20]] Suppose we run Algorithm 2 for (cid:96) = k steps. Then, the output set C is an O (log k ) -approximation to the k -means withpenalty Θ objective, in expectation.Proof sketch. The analysis of [AV07] proves this guarantee for Algorithm 1for any ϕ ( a, b ) := d ( a, b ) such that d is a metric. As the distance function d (cid:48) ( a, b ) = min( d ( a, b ) , √ Θ) still deﬁnes a metric, one can thus directly use theiranalysis to prove Theorem 1.More details are given in Appendix A. By Theorem 1, plugging inAlgorithm 2 into Lemma 1 gives an ( O (log k ) , O (log k ))-approximate algorithmfor k -means ++ with outliers. Moreover, running Algorithm 2 for O ( k ) stepsresults in an O (1)-approximation to the k -means objective in metric spaces5ADK09, BVX19]. This gives the following tri-criteria approximation viaLemma 1: we get an ( O (1) , O (1))-approximation algorithm that needs to use O ( k ) centers. The starting point of our work is the following improvement of the tri-criteriaresult from the previous subsection that enables us to get a constant factorapproximation algorithm that outputs only (1+ ε ) z outliers, which is not possibleby using Lemma 1 as a black box. The catch is that we need to use O ( k/ε )centers. Theorem 2.

Running Algorithm 2 for (cid:96) = O ( k/ε ) iterations and OPT / (2 εz ) ≤ Θ ≤ OPT / ( εz ) results in a set C with τ Θ ( X ∗ in , C ) = 20 OPT, with positiveconstant probability.Proof sketch (full proof in Appendix B).

For an optimal set of centers C ∗ = { c ∗ , . . . , c ∗ k } , we deﬁne X ∗ i ⊆ X ∗ in as the subset of points x ∈ X ∗ in with c ∗ i =arg min c ∗ ∈ C ∗ ϕ ( x, c ∗ ), where ties are broken arbitrarily. Fix one iteration ofAlgorithm 2 and let C be its current set of centers. We refer to a cluster X ∗ i as unsettled if τ Θ ( X ∗ i , C ) ≥ τ Θ ( X ∗ i , C ∗ ).Suppose that τ Θ ( X ∗ in , C ) ≥ c from X ∗ in with probability τ Θ ( X ∗ in , C ) τ Θ ( X ∗ in , C ) + τ Θ ( X ∗ out , C ) ≥ τ Θ ( X ∗ out , C ) ≥ /ε ≥ ε , where we used τ Θ ( X ∗ in , C ) ≥ τ Θ ( X ∗ out , C ) ≤ | X ∗ out | Θ ≤ OPT /ε , andthat ε is small enough. Moreover, the cost of all settled clusters is boundedby 10OPT, hence given that c is sampled from X ∗ in , the probability that it issampled from an unsettled cluster is at least τ Θ ( X ∗ in , C ) − τ Θ ( X ∗ in , C ) ≥ − , where we again used τ Θ ( X ∗ in , C ) ≥ c is sampled from anunsettled cluster according to the τ Θ -distribution, Corollary 1 in Appendix Atells us that the cluster becomes settled with probability at least . Hence, wemake a new cluster settled with probability at least ( ε/ · · = ε/ O ( k/ε ) stepsof the algorithm, either all clusters are settled with positive constant probability,and hence we have τ Θ ( X ∗ in , C ) ≤ τ Θ ( X ∗ in , C ) ≥ ε = 1 results in the same tri-criteria result we discussedin the previous subsection: This holds as τ Θ ( X ∗ in , C ) = O (OPT) implies that τ Θ ( X, C ) ≤ τ Θ ( X ∗ in , C ) + z Θ = O (OPT /ε ).6owever, we can get more out of Theorem 2: note that as τ Θ ( X ∗ in , C ) = O (OPT), for the set X out deﬁned as those x ∈ X with τ Θ ( x, C ) = Θ, we havethat | X out ∩ X ∗ in | = O (OPT)Θ = O ( εz ). Hence, setting X out as our output set ofoutliers gives on one hand | X out | ≤ | X ∗ out | + | X out ∩ X ∗ in | = z + O ( εz ) . On the other hand, we can bound ϕ ( X \ X out , C ) = τ Θ ( X \ X out , C ) ≤ τ Θ ( X, C ) = O (OPT + z Θ) = O (OPT /ε ) . Hence, we obtain an O (1 /ε ) approximation guarantee while outputting just(1 + ε ) z outliers. By itself, Theorem 2 is still not satisfactory, as it requires usto oversample the number of centers by a factor of O (1 /ε ). However, we nextshow three directions of improvement that lead to more interesting results. In this section, we present a simple sequential sampling-based algorithm for k -means with outliers that achieves an O (1 /ε ) approximation and outputs (1 + ε ) z outliers: Theorem 3.

For every < ε < , there exists an ( O (1 /ε ) , ε ) -approximationalgorithm for k -means with outliers with running time ˜ O ( nk/ε ) . The algorithm is based on ideas of a recent paper of Lattanzi and Sohler [LS19]who proposed augmenting k -means ++ with ˜ O ( k ) local search steps [KMN + k -means ++ to obtain an initial set of k centers. Afterwards, in each local search step, the algorithm samples a ( k + 1)-thpoint from the same distribution as k -means ++ . After sampling that point, thealgorithm iterates over all k + 1 current centers and takes out the one whosedeletion raises the cost the least (see Algorithm 3). Running k -means ++ followedby O ( k ) steps of Local-search++ is known to yield an O (1)-approximation forthe cost function ϕ = τ ∞ [CGPR20], with a positive constant probability. Algorithm 3

One step of

Local-search++

Input: X , C , threshold Θ Sample c ∈ X with probability τ Θ ( c, C ) /τ Θ ( X, C ) c (cid:48) ← arg min d ∈ C ∪{ c } τ Θ ( X, C \ { d } ∪ { c } ) return C \ { c (cid:48) } ∪ { c } We get Theorem 3 by proving that ˜ O ( k/ε ) iterations of Algorithm 3 withOPT / (2 εz ) ≤ Θ ≤ OPT / ( εz ) result in an ( O (1 /ε ) , ε )-approximation In this particular case, we can even get an O (1)-approximation guarantee by labelling thefurthest (1 + O ( ε )) z points as outliers, since the set X out ∪ X ∗ out has size at most (1 + O ( ε )) z and ϕ ( X \ ( X out ∪ X ∗ out ) , C ) ≤ τ Θ ( X ∗ in , C ) = O (OPT). τ Θ ,similarly as Theorem 1, and, hence, by Lemma 1 one obtains an( O (1) , O (1))-approximation algorithm. Our reﬁned analysis, crucially, does not use Lemma 1 as a black box. Instead, we use the idea of the lemma as abuilding stone for the rather intricate analysis of the algorithm.The intuition behind our proof of Theorem 3 comes from Theorem 2: thediﬀerence between running k -means++ for O ( k/ε ) steps and the local searchalgorithm we described above is that the local search algorithm additionallyremoves one point after each sampling step. One can still hope that the increasein cost due to the removals is dominated by the decrease in cost due to thenewly sampled center making an unsettled cluster settled, as we have seen inthe analysis of Theorem 2. This is indeed the case. We note that a local searchbased algorithm was also considered by [GKL + k -means withOutliers To model the distributed setting, we consider the coordinator model where theinput data X = X (cid:116) X . . . (cid:116) X m is split across m machines. Each machine canﬁrst perform some local computation. Then, each machine sends a message to aglobal coordinator who computes the ﬁnal set of cluster centers C . The maincomplexity measure is the total number of bits each machine needs to send tothe coordinator. An informal version of our main distributed result states thefollowing. Theorem 4.

There exists an ( O (1) , ε ) -approximate distributed algorithm inthe coordinator model such that each machine sends at most ˜ O ( k/ε ) many bits.Moreover, each machine only needs to perform polynomial-time computations. This improves on a recent result of [LG18]. They give an algorithm with a(1 + ε, ε )-approximation guarantee, but the required local running time isexponential. Other results similar to ours include [GLZ17, CAZ18]. Below, wesketch the high-level idea of the constructions and leave details to Appendix F. Simpler Construction

In this paragraph we explain the high-level intuitionbehind a weaker result that provides an ( O (1 /ε ) , ε )-approximation. Thisgeneralizes a classical construction of [GMM +

03] that works in the k -means setting. For simplicity, we assume here that the machines know thevalue OPT and we deﬁne Θ := OPT / ( εz ). This assumption can be lifted at thecost of a logarithmic overhead in the time - and message complexity.Each machine starts by running Algorithm 2 for ˜ O ( k/ε ) steps with input X j to obtain a set of centers Y j . The sets of centers satisfy the property8 j τ Θ ( X j ∩ X ∗ in , Y j ) = O (OPT). This follows from an argument very similarto Theorem 2 and will be important later on. Now, let X out,j denote all pointsin X j that have a squared distance of at least Θ to the closest point in Y j . Allpoints in X out,j are declared as outliers and the remaining points in X j \ X out,j are moved to the closest point in Y j which creates a new, weighted instance X (cid:48) j . Each machine sends its weighted instance together with the number ofdeclared outliers to the coordinator. The coordinator combines the weightedinstances and ﬁnds an ( O (1 /ε ) , ε )-approximate clustering using the algorithmguaranteed by Theorem 9 on the instance X (cid:48) , but with the number of outliersequal to z (cid:48) = z − | X out | + O ( εz ), where X out := (cid:83) j X out,j . Let C denote thecorresponding set of cluster centers and X (cid:48) out denote the set consisting of the z (cid:48) outliers. In Theorem 15 in Appendix F we prove that ( C, X out ∪ X (cid:48) out ) is an( O (1 /ε ) , ε )-approximation.The (simple) analysis follows from two observations. First, the number ofinliers incorrectly labelled as outliers in the ﬁrst step, i.e., | (cid:83) j ( X out,j ∩ X ∗ in ) | ,is bounded by O ( εz ). This follows from (cid:80) j τ Θ ( X j ∩ X ∗ in , Y j ) = O (OPT) and τ Θ ( x, Y j ) = Θ for every x ∈ X out,j . This ensures that in the end we output only O ( εz ) more outliers than if the coordinator ran Algorithm 4 on the full datasetwith original parameter z .Second, the total movement cost of changing the instance X to instance X (cid:48) is bounded by (cid:80) j ϕ ( X j \ X out,j , Y j ) ≤ (cid:80) j τ Θ ( X j , Y j ) ≤ z Θ + (cid:80) j τ Θ ( X j ∩ X ∗ in , Y j ) = O (OPT /ε ). Hence, the total cost changes additively by O (OPT /ε )compared to the case where the coordinator runs Algorithm 4 on the full dataset. Reﬁned Version

To improve the approximation factor from O (1 /ε ) down to O (1) in Theorem 4, we need to perform two changes. First, the coordinatorruns the polynomial-time ( O (1) , X j \ X out,j closest to each point y ∈ Y j , but itadditionally sends for each integer k ∈ O (log ∆), how many of those pointshave a distance between 2 k and 2 k +1 to y . The coordinator then constructs aninstance based on this reﬁned information. k -means (cid:107) Adapting the popular k -means (cid:107) algorithm to the setting withoutliers [BMV +

12] can also be accomplished with a construction similar to theone explained above. The only diﬀerence is that instead of getting the weightedinstance as a union of weighted instances from each machine, it is constructedin O (log n ) sampling rounds by using the k -means || idea: in each round wesample from the same distribution as in Algorithm 2, but we sample ˜ O ( k/ε )points instead of just one point. 9 Tight Bounds for ( O (1) , O (1)) -approximationAlgorithms In this section, we discuss tight bounds on the complexity of ﬁnding an( O (1) , O (1))-approximation for the k -means with outliers objective. InAppendix D we prove the following result. Theorem 5.

There is an ( O (1) , O (1)) -approximation algorithm for k -meanswith outliers that runs in time ˜ O ( nk · min(1 , k/z )) and succeeds with positiveconstant probability. This result is based on Theorem 2, but to speed up the algorithm from ˜ O ( nk )to ˜ O ( nk /z ), we use the Metropolis-Hastings algorithm (with uniform proposaldistribution) which was used in [BLHK16b, BLHK16a] for k -means. Here, theidea is that in one sampling step of Algorithm 2, instead of computing τ Θ ( x, C )for all the points, we ﬁrst subsample ˜ O ( n/z ) points uniformly at random andonly from those points we then sample roughly proportional to their current τ Θ cost by deﬁning a certain Markov chain. After speedup, we keep the guaranteesof Theorem 2, but now the running time is ˜ O ( nk /z ).Note that instead of using Metropolis-Hastings algorithm, we could also justtake a uniform subsample and run Algorithm 2 on it. The result of [HJLW18]would give that uniform subsampling the number of points to ˜ O ( k ( n/z ) ) wouldlose only a constant factor in approximation guarantees. This leads to a˜ O ( k ( n/z ) ) running time and an algorithm with similar running time alsobased on uniform subsampling for k -median was given by [MOP04]. Lower Bound

Next, we provide a matching lower bound. To that end, werestrict ourselves to the class of algorithms that work in an arbitrary metricspace M and access the distances of the space only by asking an oracle thatupon getting queried on two points x, y ∈ M returns their distance.Virtually all algorithms with guarantees we know of are of this type, possiblyup to a constant loss in their approximation guarantee. In Appendix A we verifythat this is the case also for algorithms that we consider here.For the classical problems of k -means , k -median, and k -center, there isan Ω( nk ) lower bound (i.e., showing that so many queries to the oracle arenecessary) for metric space algorithms [MP04, Met02]. This essentially matchesthe complexity of k -means ++ [AV07] or the Local-search++ algorithm ofLattanzi and Sohler [LS19]. The lower bound holds also in the setting withoutliers, if the output of the algorithm is an assignment of each of the n pointsto an optimal cluster, together with the list of the outliers. On the other hand,Theorem 5 gives an ( O (1) , O (1)) algorithm with complexity ˜ O ( nk /z ). Thecatch is that the output of this algorithm is just a set of centers and it does notcompute for all the points their respective assignment to a closest center. Also,it does not label all the outliers. For this type of algorithms that only outputthe set of centers, we prove a matching lower bound of Ω( nk /z ) in the metricspace query model. The following theorem is proved in Appendix E.10 heorem 6. Any randomized algorithm for the k -means/ k -median/ k -centerproblem with outliers in the setting k ≥ C , z ≥ Ck log k , and n ≥ Cz foran absolute constant C that with probability at least . gives an ( O (1) , O (1)) -approximation in the general metric space query model, needs Ω( nk /z ) queries. Let us provide a brief intuition of the proof. The construction that yieldsTheorem 6 is the following: we have k/ k/ z ) points.As the algorithm can output only Θ( z ) outliers, it needs to “ﬁnd” Ω( k ) smallclusters. However, a point chosen from the input set uniformly at random isfrom a small cluster with probability O ( z/n ) and we expect to need Ω( k ) queriesuntil we ﬁnd out whether a point is from a small or a big cluster. This leads tothe lower bound of Ω( k · nz · k ) = Ω( nk /z ) rounds. Why it makes sense to consider bicriteria approximation

Let us alsoobserve a diﬀerent lower bound against ( O (1) , ε ) z outliers.The following construction is, e.g., in [Ind99]. Consider an input metric spacewith z = n − k = 1, where any two points have a distance of 1, up toa single pair x , x whose distance is 0 (or some small ε ). Any approximationalgorithm (even a randomized one) that outputs exactly z outliers needs to “ﬁnd”the pair x , x and Ω( z ) queries are needed for this. The example shows thatthere is a fundamental limit to the speed of ( O (1) , z linear in n they even need Ω( n ) time. This should be contrasted withthe ( O (1) , ε )-approximation algorithms that only need poly( k/ε ) time for z = Θ( n ) that follows from [HJLW18]. We tested the following algorithms on the datasets kdd (KDD Cup 1999)subsampled to 10 000 points with 38 dimensions and spam (Spambase) with4601 points in 58 dimensions [DG17]. We set the number of outliers z to be 10percent of the dataset. • Lloyd : Variant of Lloyd’s algorithm [Llo82, CG13] that handles outlierswith random initialization (10 iterations); • k -means++ : k -means++ seeding[AV07]; • k -means++ with penalties : k -means++ with penalties[BVX19, Li20]; • Metropolized k -means++ with penalties : k -means++ with penalties,sped up by Metropolis-Hastings algorithm with 100 steps (see Appendix D); • Distributed k -means++ with penalties : simpliﬁed variant ofAlgorithm 5 – input is partitioned in 10 subinputs, k -means++ withpenalties is run on each of them with (cid:96) = 2 k , and weighted instances are11ent to coordinator who runs k -means++ with penalties again to obtainthe ﬁnal k centers); • Sped up local search : Variant of Algorithm 4 – k -means++ withpenalties followed by k additional local search steps (for the objectivewith penalties).To guess the value of Θ in all except the ﬁrst two algorithms, we tried 10values from 1 to 10 , exponentially separated. The best solution was then pickedand we followed by running 10 Lloyd iterations on it with the number of outliersfor these iterations set to z (the same for the second k -means++ algorithm).The results for this setup for k ∈ { , , . . . , } are in Figs. 1 and 2.Figure 1: Experiments on kddFigure 2: Experiments on spam k -means with penalties outperforms the ﬁrst two baseline algorithms on12verage by around 40% in both datasets. Surprisingly, k -means++ seedingleads to consistently worse solutions than random initialization. We believe thisindicates that the datasets indeed contain outliers that k -means++ pickspreferably due to their large distance from other points. Distributed andmetropolized variants are on par with k -means++ with penalties , except forthe metropolized variant on the spam dataset. Sped up local search consistently outperforms k -means++ with penalties by around 12% in bothdatasets. It is also signiﬁcantly slower, but we implemented only its simple˜ O ( nk ) implementation instead of the best possible ˜ O ( nk ). We have shown that several simple sampling-based algorithms for k -means canbe adapted to handle outliers and retain strong theoretical guarantees, while stillbeing similarly simple to implement. As a theoretical application, we settled thecomplexity of ﬁnding an ( O (1) , O (1))-approximation for k -means with outliersto ˜Θ( nk /z ) in the query model. Acknowledgment

We thank Davin Choo, Mohsen Ghaﬀari, Saeed Ilchi, Andreas Krause, andJulian Portmann for engaging discussions. In particular, we thank Mohsen forhis numerous helpful remarks and Davin and Saeed for sharing their code withus.

References [ADHP09] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat.Np-hardness of euclidean sum-of-squares clustering.

Machinelearning , 75(2):245–248, 2009.[ADK09] Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptivesampling for k-means clustering. In

Approximation, Randomization,and Combinatorial Optimization. Algorithms and Techniques , pages15–28. Springer, 2009.[ANFSW19] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and JustinWard. Better guarantees for k-means and euclidean k-median byprimal-dual algorithms.

SIAM Journal on Computing , (0):FOCS17–97, 2019.[AV07] David Arthur and Sergei Vassilvitskii. k-means++: The advantagesof careful seeding. In

Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms , pages 1027–1035. Societyfor Industrial and Applied Mathematics, 2007.13BBV04] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe.

Convexoptimization . Cambridge university press, 2004.[BERS19] Anup Bhattacharya, Jan Eube, Heiko R¨oglin, and Melanie Schmidt.Noisy, greedy and not so greedy k-means++. arXiv preprintarXiv:1912.00653 , 2019.[BLHK16a] Olivier Bachem, Mario Lucic, Hamed Hassani, and Andreas Krause.Fast and provably good seedings for k-means. In

Advances in neuralinformation processing systems , pages 55–63, 2016.[BLHK16b] Olivier Bachem, Mario Lucic, S Hamed Hassani, and AndreasKrause. Approximate k-means++ in sublinear time. In

ThirtiethAAAI Conference on Artiﬁcial Intelligence , 2016.[BLSS16] Johannes Bl¨omer, Christiane Lammersen, Melanie Schmidt, andChristian Sohler. Theoretical analysis of the $k$-means algorithm -A survey.

CoRR , abs/1602.08254, 2016.[BMV +

12] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, RaviKumar, and Sergei Vassilvitskii. Scalable k-means++.

Proceedingsof the VLDB Endowment , 5(7):622–633, 2012.[BR20] Aditya Bhaskara and Aravinda Kanchana Rwanpathirana. Robustalgorithms for online k -means clustering. In Algorithmic LearningTheory , pages 148–173, 2020.[BVX19] Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedysampling for approximate clustering in the presence of outliers.In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch´e-Buc,E. Fox, and R. Garnett, editors,

Advances in Neural InformationProcessing Systems 32 , pages 11148–11157. Curran Associates, Inc.,2019.[CAZ18] Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A practicalalgorithm for distributed clustering and outlier detection. In

Advances in neural information processing systems , pages 2248–2256, 2018.[CG13] Sanjay Chawla and Aristides Gionis. k-means–: A uniﬁed approachto clustering and outlier detection. In

Proceedings of the 2013SIAM International Conference on Data Mining , pages 189–197.SIAM, 2013.[CGPR20] Davin Choo, Christoph Grunau, Julian Portmann, and V´aclavRozhoˇn. k-means++: few more steps yield constant approximation,2020. 14Che08] Ke Chen. A constant factor approximation algorithm for k-medianclustering with outliers. In

Proceedings of the nineteenth annualACM-SIAM symposium on Discrete algorithms , pages 826–835,2008.[CKMN01] Moses Charikar, Samir Khuller, David M Mount, and GiriNarasimhan. Algorithms for facility location problems with outliers.In

Proceedings of the twelfth annual ACM-SIAM symposium onDiscrete algorithms , pages 642–651. Society for Industrial andApplied Mathematics, 2001.[DG17] Dheeru Dua and Casey Graﬀ. UCI machine learning repository,2017.[FL11] Dan Feldman and Michael Langberg. A uniﬁed framework forapproximating and clustering data. In

Proceedings of the forty-thirdannual ACM symposium on Theory of computing , pages 569–578,2011.[FRS19] Zachary Friggstad, Mohsen Rezapour, and Mohammad RSalavatipour. Local search yields a ptas for k-means in doublingmetrics.

SIAM Journal on Computing , 48(2):452–480, 2019.[GKL +

17] Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, andSergei Vassilvitskii. Local search methods for k-means with outliers.

Proceedings of the VLDB Endowment , 10(7):757–768, 2017.[GLZ17] Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering.

CoRR , abs/1703.01539, 2017.[GMM +

03] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani,and Liadan O’Callaghan. Clustering data streams: Theory andpractice.

IEEE transactions on knowledge and data engineering ,15(3):515–528, 2003.[HJLW18] Lingxiao Huang, Shaofeng Jiang, Jian Li, and Xuan Wu. Epsilon-coresets for clustering (with outliers) in doubling metrics. In , pages 814–825. IEEE, 2018.[Ind99] Piotr Indyk. Sublinear time algorithms for metric space problems.In

Proceedings of the thirty-ﬁrst annual ACM symposium on Theoryof computing , pages 428–434, 1999.[IQM +

20] Sungjin Im, Mahshid Montazer Qaem, Benjamin Moseley, XiaoruiSun, and Rudy Zhou. Fast noise removal for k -means clustering,2020. 15KLS18] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constantapproximation for k-median and k-means with outliers via iterativerounding. In Proceedings of the 50th Annual ACM SIGACTSymposium on Theory of Computing , STOC 2018, page 646–659,New York, NY, USA, 2018. Association for Computing Machinery.[KMN +

04] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine DPiatko, Ruth Silverman, and Angela Y Wu. A local searchapproximation algorithm for k-means clustering.

ComputationalGeometry , 28(2-3):89–112, 2004.[KSS04] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simplelinear time (1+/spl epsiv/)-approximation algorithm for k-meansclustering in any dimensions. In , pages 454–462. IEEE, 2004.[LG18] Shi Li and Xiangyu Guo. Distributed k-clustering for data withheavy noise.

CoRR , abs/1810.07852, 2018.[Li20] Min Li. The bi-criteria seeding algorithms for two variants ofk-means problem.

Journal of Combinatorial Optimization , 2020.[Liu96] Jun S Liu. Metropolized independent sampling with comparisonsto rejection sampling and importance sampling.

Statistics andcomputing , 6(2):113–119, 1996.[Llo82] Stuart Lloyd. Least squares quantization in pcm.

IEEE transactionson information theory , 28(2):129–137, 1982.[LS19] Silvio Lattanzi and Christian Sohler. A better k-means++algorithm via local search. In

International Conference on MachineLearning , pages 3662–3671, 2019.[Met02] Ramgopal Reddy Mettu.

Approximation algorithms for np-hardclustering problems . PhD thesis, 2002.[MNV09] Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan.The planar k-means problem is np-hard. In

International Workshopon Algorithms and Computation , pages 274–285. Springer, 2009.[MOP04] Adam Meyerson, Liadan O’callaghan, and Serge Plotkin. A k-median algorithm with running time independent of data size.

Machine Learning , 56(1-3):61–87, 2004.[MP04] Ramgopal R Mettu and C Greg Plaxton. Optimal time bounds forapproximate clustering.

Machine Learning , 56(1-3):35–60, 2004.[MRR +

53] Nicholas Metropolis, Arianna W Rosenbluth, Marshall NRosenbluth, Augusta H Teller, and Edward Teller. Equation ofstate calculations by fast computing machines.

The journal ofchemical physics , 21(6):1087–1092, 1953.16ORSS13] Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, andChaitanya Swamy. The eﬀectiveness of lloyd-type methods forthe k-means problem.

Journal of the ACM (JACM) , 59(6):1–22,2013.[Roz20] V´aclav Rozhoˇn. Simple and sharp analysis of k-means || , 2020.[Yao] Andrew Chi-Chin Yao. Probabilistic computations: Toward auniﬁed measure of complexity. In , pages 222–227. IEEE.17 ppendices A Preparatory lemmas

More notation

We start by setting up some more notation. Note that theoptimal clustering C ∗ splits the dataset X into two parts: the set of inliers X ∗ in with ϕ ( X ∗ in , C ∗ ) = OPT and the set of outliers X ∗ out with | X ∗ out | = z . Moreover,the set X ∗ in is naturally split in k sets X ∗ , X ∗ , . . . , X ∗ k , where X ∗ j is the set ofpoints x ∈ X ∗ in for which ϕ ( x, C ) = ϕ ( x, c ∗ j ). When using the notation τ Θ ( X, C ),we often drop the subscript Θ when it is clear from context and write just τ ( X, C ).For a single point c , we drop parentheses and write ϕ ( X, c ) instead of ϕ ( X, { c } )and similarly for τ . We deﬁne ϕ − z ( X, C ) = min X out , | X out | = z ϕ ( X \ X out , C ). Weuse out Θ ( X, C ) to denote the subset of points x ∈ X such that τ Θ ( x, C ) = Θand in Θ ( X, C ) to denote the subset of points x ∈ X such that τ Θ ( x, C ) < Θ.We assume that the minimum distance between any two points in X is lowerbounded by 1 and their maximum distance is upper bounded by ∆.In this section we collect several useful lemmas that are either well-known orare variations of well-known results. First, we recall the well-known Chernoﬀbounds and then, for completeness, prove a generalization of sampling lemmasthat were used to analyze k -means ++ [AV07]. These were, in slightly diﬀerentforms, also proved in [Li20] or [BVX19]. Theorem 7 (Chernoﬀ bounds) . Suppose X , . . . , X n are independent randomvariables taking values in { , } . Let X denote their sum. Then for any ≤ δ ≤ we have P ( X ≤ (1 − δ ) E [ X ]) ≤ e − E [ X ] δ / and P ( X ≥ (1 + δ ) E [ X ]) ≤ e − E [ X ] δ / . Moreover, for any δ ≥ we haveP ( X ≥ (1 + δ ) E [ X ]) ≤ e − E [ X ] δ/ . Next, we state basic facts about metric spaces and R d used to prove thesampling lemmas below. Fact 1 (cf. [AV07]) . Let A be a subset of R d . Then inf µ ∈ R d ϕ ( A, µ ) = ϕ ( A, µ ( A )) where µ ( A ) = (cid:80) x ∈ A x/ | A | is the mean of A . Fact 2 (cf. [AV07]) . Let a, b, c ∈ M be three points in an arbitrary metric space.Then ϕ ( a, c ) ≤ ϕ ( a, b ) + ϕ ( b, c ))18 nd, more generally, for C ⊆ M and ε > we have ϕ ( a, C ) − ϕ ( b, C ) ≤ ε · ϕ ( b, C ) + (cid:18) ε (cid:19) ϕ ( a, b ) , where the former inequality is a special case for ε = 1 and C = { c } .Proof. Let d denote the distance function of M . For c ∈ C such that d ( b, C ) = d ( b, c ) we have ϕ ( a, C ) = d ( a, C ) = d ( a, c ) ≤ ( d ( a, b ) + d ( b, c )) = ( d ( a, b ) + d ( b, C )) = d ( a, b ) + d ( b, C ) + 2 d ( a, b ) d ( b, C ) ≤ d ( a, b ) + d ( b, C ) + 1 ε d ( a, b ) + εd ( b, C )= ϕ ( a, b ) + ϕ ( b, C ) + 1 ε ϕ ( a, b ) + εϕ ( b, C )which is the the needed inequality, up to rearrangement. We used that2 d ( a, b ) d ( b, C ) ≤ ε d ( a, b ) + εd ( b, C )is equivalent to ( √ εd ( b, C ) − d ( a, b ) / √ ε ) ≥ . Fact 3.

Let M be an arbitrary metric space with distance function d . Then forany constant T > , the function d (cid:48) : d (cid:48) ( a, b ) = min( d ( a, b ) , T ) is also a distancefunction.Proof. For any a, b, c ∈ M , we need to prove d (cid:48) ( a, c ) ≤ d (cid:48) ( a, b ) + d (cid:48) ( b, c ). If d ( a, b ) ≥ T or d ( b, c ) ≥ T , respectively, we have d (cid:48) ( a, b ) = T or d (cid:48) ( b, c ) = T ,respectively, and, hence, d (cid:48) ( a, b ) + d (cid:48) ( b, c ) ≥ T ≥ d (cid:48) ( a, c ), as needed. Otherwise,we have d (cid:48) ( a, c ) ≤ d ( a, c ) ≤ d ( a, b ) + d ( b, c ) = d (cid:48) ( a, b ) + d (cid:48) ( b, c ), as needed.Fact 3 for example implies that in Fact 2 we can replace ϕ by τ Θ , as τ Θ ( a, b ) = min( d ( a, b ) , √ Θ) . It also implies Theorem 1, as the analysis of[AV07] holds for an arbitrary metric space. However, this loses a factor of2 in approximation guarantee of Theorem 1 as we explain next. Proofs ofTheorem 1 in [Li20, BVX19] instead generalize Lemmas 3.1 and 3.2 in [AV07]for τ costs to get Theorem 1; this can recover the same constants as in [AV07].For completeness, we also reprove these lemmas here as Lemmas 2 and 4 sincewe refer to them later in the proofs. Also, we prove a version of Lemma 2 formetric spaces as Lemma 3. This is the reason why arguments for arbitrary metricspaces lose additional factor of 2 in approximation guarantees. The relevance ofLemma 3 comes from the fact that it implies our algorithms work in arbitrarymetric spaces, which is important for applications in Section 6.19 emma 2 (cf. Lemma 3.1 in [AV07]) . For any cluster A ⊆ R d we have | A | (cid:88) c ∈ A τ Θ ( A, c ) ≤ µ ∈ R d τ Θ ( A, µ ) . Proof.

Recall that for the cost function ϕ we have | A | (cid:80) c ∈ A ϕ ( A, c ) = 2 ϕ ( A, µ ( A )) by Lemma 3.1 in [AV07]. Let us ﬁx anarbitrary µ ∈ R d and denote A in = in Θ ( A, µ ) and A out = out Θ ( A, µ ). Note that τ Θ ( A, µ ( A in )) ≤ τ Θ ( A, µ ) due to Fact 1. We have1 | A | (cid:88) c ∈ A τ Θ ( A, c ) = 1 | A | ( (cid:88) c ∈ A in τ Θ ( A in , c ) + (cid:88) c ∈ A in τ Θ ( A out , c ) + (cid:88) c ∈ A out τ Θ ( A, c )) ≤ | A | ( (cid:88) c ∈ A in ϕ ( A in , c ) + | A in || A out | Θ + | A out || A | Θ)= 1 | A | (2 | A in | ϕ ( A in , µ ( A in )) + | A in || A out | Θ + | A out || A | Θ) Lemma 3.1 in [AV07] ≤ ϕ ( A in , µ ( A in )) + 2 | A out | Θ ≤ τ Θ ( A, µ ( A in )) ≤ τ Θ ( A, µ ) . Lemma 3.

For any metric space M and subset of its points A we have | A | (cid:88) c ∈ A ϕ ( A, c ) ≤ µ ∈ M ϕ ( A, µ ) . Proof.

For any µ ∈ M , we can write1 | A | (cid:88) c ∈ A ϕ ( A, c ) = 1 | A | (cid:88) c ∈ A (cid:88) d ∈ A ϕ ( d, c ) ≤ | A | (cid:88) c ∈ A (cid:88) d ∈ A ϕ ( d, µ ) + ϕ ( µ, c ) Fact 2= 4 ϕ ( A, µ ) , as needed.The second sampling lemma states that sampling a point from a cluster A proportional to its τ -cost with respect to the current clustering C results in an8-approximation of the cost of A . Lemma 4 (cf. Lemma 3.2 in [AV07], Lemma 4 in [Li20] and Lemma 6 in[BVX19]) . For any cluster A ⊆ R d and an arbitrary point set C , we have (cid:88) c ∈ A τ Θ ( c, C ) τ Θ ( A, C ) τ Θ ( A, C ∪ { c } ) ≤ µ ∈ R d τ Θ ( A, µ ) . roof. Fix an arbitrary µ ∈ R d and c, d ∈ A . We have τ Θ ( c, C ) ≤ τ Θ ( c, d ) + τ Θ ( d, C ))by Fact 2. For a given c ∈ A we can then average over all d ∈ A to get τ Θ ( c, C ) ≤ | A | (cid:88) d ∈ A ( τ Θ ( c, d ) + τ Θ ( d, C )) = 2 | A | ( τ Θ ( A, c ) + τ Θ ( A, C )) . (1)This implies (cid:88) c ∈ A τ Θ ( c, C ) τ Θ ( A, C ) τ Θ ( A, C ∪ { c } ) ≤ (cid:88) c ∈ A | A | ( τ Θ ( A, c ) + τ Θ ( A, C )) τ Θ ( A, C ) (cid:88) d ∈ A min(Θ , ϕ ( d, C ) , ϕ ( d, c )) Eq. (1) ≤ (cid:88) c ∈ A | A | τ Θ ( A, c ) τ Θ ( A, C ) (cid:88) d ∈ A min(Θ , ϕ ( d, C )) + (cid:88) c ∈ A | A | τ Θ ( A, C ) τ Θ ( A, C ) (cid:88) d ∈ A min(Θ , ϕ ( d, c ))= (cid:88) c ∈ A | A | τ Θ ( A, c ) + 2 | A | (cid:88) c ∈ A τ Θ ( A, c )= 4 | A | (cid:88) c ∈ A τ Θ ( A, c ) ≤ τ Θ ( A, µ ) . Lemma 2Both Lemma 2 and Lemma 4 were proven in the same way as correspondinglemmas in [AV07], and Theorem 1 follows from these lemmas in the sameway as Lemma 3.3 in [AV07] follows from Lemmas 3.1 and 3.2 in [AV07] (cf.[BVX19, Li20]). Finally, we will use a simple corollary of Lemma 4.

Corollary 1.

For any cluster A ⊆ R d and an arbitrary point set C , sampling apoint c ∈ A proportional to τ Θ ( c, C ) results in τ Θ ( A, C ∪ { c } ) ≤

10 inf µ ∈ R d τ Θ ( A, µ ) , with probability at least .Proof. Follows from Lemma 4 by Markov inequality.

B Basic Oversampling Result

In this section we prove a formal and more general version of Theorem 2. Thefull generality of the theorem is needed at a later point in time. To obtainTheorem 2, plug in some arbitrary Θ ∈ [OPT / (2 εz ) , OPT / ( εz )] and δ = 0 . heorem 8 (cf. [ADK09]) . Let δ, ε ∈ (0 , be arbitrary and suppose we runAlgorithm 2 for (cid:96) = O ( k/ε · log 1 /δ ) steps to get output C . Then, with probabilityat least − δ we have τ Θ ( X ∗ in , C ) = O ( OPT + εz Θ) . Proof.

We refer to a cluster X ∗ i as unsettled if τ ( X ∗ i , C ) ≥ τ ( X ∗ i , µ ( X ∗ i )). Fixone iteration of Algorithm 2 and let C be its current set of centers. Suppose that τ ( X ∗ in , C ) ≥ εz Θ), since otherwise we are already done. We sample anew point c from X ∗ in with probability τ ( X ∗ in , C ) τ ( X ∗ in , C ) + τ ( X ∗ out , C ) ≥ εz Θ20 εz Θ + z Θ ≥ ε/ , where we used τ ( X ∗ in , C ) ≥ εz Θ, τ ( X ∗ out , C ) ≤ | X ∗ out | Θ = z Θ, and ε ≤ c is sampled from X ∗ in , the probability that it is sampledfrom an unsettled cluster is at least τ ( X ∗ in , C ) − τ ( X ∗ in , C ) ≥ − ≥ , where we used τ ( X ∗ in , C ) ≥ c is sampled from an unsettledcluster according to the τ -distribution, Corollary 1 tells us that the clusterbecomes settled with probability at least . Hence, we make a new clustersettled with probability at least ( ε/ · · = ε/ O ( k/ε · log 1 /δ ) iterations. We call each iteration good , ifeither τ ( X ∗ in , C ) ≤ εz Θ), or we make a new cluster settled. As eachiteration is good with probability at least ε/

20, the expected number of gooditerations is Ω( k log 1 /δ ). By Theorem 7, at least k iterations are good withprobability at least 1 − e − Ω(log 1 /δ ) ≥ − δ . This implies that in the end either τ ( X ∗ in , C ) ≤ εz Θ) and we are done, or all clusters are settled. Butthis implies τ ( X ∗ in , C ) ≤ (cid:80) i τ ( X ∗ i , µ ( X ∗ i )) ≤ (cid:80) i ϕ ( X ∗ i , µ ( X ∗ i )) = 10OPT,so we are again done. C Proof of Theorem 9

In this section we explain in more detail how to adapt the local-search algorithmof [LS19] in order to obtain an ( O (1 /ε ) , ε )-approximation algorithm for k -means with outliers.As explained in the main part of the paper, the main idea is to run thelocal-search algorithm with penalties. That is, each point is sampled as anew cluster center proportional to its τ Θ -cost, where Θ = Θ( OPT εz ). In total,we run O ( k log log k + k log(1 /ε ) /ε ) local-search steps. Hence, the algorithmcan be implemented in time ˜ O ( nk/ε ) (cf. [CGPR20], Section 3.3). After theﬁrst O ( k log log k ) search steps, we show that the current cost is with constantprobability at most O (OPT /ε ). The analysis is quite similar to the analysis of[LS19]. Afterwards, within the next O ( k log(1 /ε ) /ε ) steps, we show that the22umber of points with a squared distance of at least 10Θ (The extra factor of 10is needed in the analysis for technical reasons) drops below (1 + ε ) z with constantprobability. Hence, we can declare all those points as outliers and the ϕ -costof each remaining point is then only at most 10 times larger than its τ Θ cost.Therefore, ϕ ( X, C ) can still be bounded by O (OPT /ε ), as needed. The formalalgorithm description is given below as Algorithm 4. Recall that Θ depends onOPT and as OPT is unknown to the algorithm, it needs to be ”guessed”. Alsonote that due to technicalities in the analysis, the algorithm does not simplyoutput the solution one obtains after running all the local-search steps, but isalso considers all ”intermediate” solutions as ﬁnal solutions. The reason is that,unlike the total cost, the number of points with a squared distance of at least10Θ is not necessarily decreasing monotonically. Algorithm 4

Local-search++ with outliersInput: X , k , z , ε , ∆Assumptions: k + z < | X | , ε ∈ (0 , C ← ∅ X out ← ∅ for i = 0 to (cid:100) log( n ∆ ) (cid:101) do OP T guess ← i , β ← /ε , Θ ← βOP T guess z C i, ← k -means ++ ( X, k,

Θ) (Algorithm 2) for j ← , , . . . , O ( k log log k + k · log(1 /ε ) ε ) do C i,j ← Local-search++ step(

X, C i,j − , Θ) (Algorithm 3) X i,jout ← out ( X, C ) if | X i,jout | ≤ (1 + ε ) z and ϕ ( X \ X i,jout , C i,j ) < ϕ ( X \ X out , C ) then C ← C i,j X out ← X i,jout end if end for end for return ( C, X out ) Theorem 9 (Formal version of Theorem 3) . Let ε ∈ (0 , be arbitrary and X ⊂ R d be a set of n points. Furthermore, let k ∈ N and z ∈ N be such that k + z < n (otherwise, the cost is ). Then, with positive constant probability,Algorithm 4 returns a set X out containing at most (1 + ε ) z points and a set of k cluster centers C such that ϕ ( X \ X out , C ) = O ( OPT /ε ) for OPT = min C (cid:48) ,X (cid:48) out : | C (cid:48) | = k, | X (cid:48) out | = z ϕ ( X \ X (cid:48) out , C (cid:48) ) . Before going into the actual proof of Theorem 9, we ﬁrst need to introduceadditional notation. As before, X denotes the set of all input points and X ∗ out denotes the set of all input points that are declared as outliers by the optimal23olution under consideration. Moreover, let X ∗ far ⊆ X \ X ∗ out denote the setconsisting of those points in X \ X ∗ out that have a squared distance of at least Θto the closest optimal cluster center. Now, we deﬁne X ∗ in = X \ ( X ∗ out ∪ X ∗ far ).In other words, we split X = X ∗ in (cid:116) X ∗ far (cid:116) X ∗ out (in Section 2 we used X = X ∗ in (cid:116) X ∗ out ).For a ﬁxed candidate clustering C and for i ∈ [ k ], X i ⊆ X ∗ in denotes the setconsisting of those points in X ∗ in for which the closest center c ∈ C is the center c i . Similarly, let X ∗ i denote the set consisting of those points in X ∗ in that areclustered to the center c ∗ i . Moreover, X far,i denotes the set of points in X ∗ far that are closest to c i among all candidate centers in C .We say that the optimal cluster X ∗ i is ignored by C if for every x ∈ X ∗ i we have τ Θ ( x, C ) = Θ. We deﬁne I to be the set of indices correspondingto ignored clusters. Next, we deﬁne a function f : [ k ] \ I (cid:55)→ [ k ] with f ( i ) =arg min j ∈ k d ( c ∗ i , c j ) for every i ∈ [ k ] \ I . Intuitively, the function f maps eachoptimal cluster that is not ignored to the closest candidate center (cf. Fig. 3).Now, for each candidate center, either zero, one or more than one optimal clustermaps to it. If none of the optimal clusters map to it, then we call the candidatecenter lonely and we denote the indices of all the lonely candidate centers by L . If exactly one optimal cluster maps to a given candidate center, then wecall the candidate center matched and H denotes the set of indices of all thematched candidate centers. Without loss of generality, we assume that f ( i ) = i for every i ∈ H . Finally, if more than one optimal cluster maps to a candidatecenter, then we call it popular . We refer to an optimal cluster X ∗ i with clusterindex i ∈ [ k ] \ I as a cheap cluster if X ∗ i maps to some popular candidate centerwith corresponding index j and, moreover, X ∗ i has the smallest cost with respectto C among all optimal clusters that are mapped to the popular candidatecenter with index j . We let T denote the set of indices corresponding to cheapclusters. Let b : [ k ] \ T (cid:55)→ H ∪ L be an arbitrary bijection with b ( i ) = i for every i ∈ H (cf. Fig. 3). That is, each optimal cluster center that is not cheap getsuniquely assigned to a candidate center which is not popular while preservingthe matching between matched cluster centers. Note that such a bijection existssince the number of cheap optimal clusters is equal to the number of popularcandidate centers.For i ∈ H ∪ L , we deﬁneoutliers( c i ) = { x ∈ in ( X ∗ out , C ) : i = arg min j ∈ [ k ] ϕ ( x, c i ) } . That is, outliers( c i ) contains the real outliers that are closest to the candidatecenter c i and which have a squared distance of at most 10Θ to c i . The factor of10 comes into play because of the following lemma. Lemma 5.

Let C denote the current set of candidate centers and assume thereexists some x ∈ X ∗ i with ϕ ( x, C ) ≥ . Then, X ∗ i is ignored by C .Proof. Let x (cid:48) ∈ X ∗ i be arbitrary and let c (cid:48) denote the candidate center in C thatis closest to x (cid:48) . We have 24 Lpopular cheap fb T IcheapT X ∗ X ∗ X ∗ k X ∗ X ∗ X ∗ k c c popular c k c c c k matched lonely Figure 3: The ﬁgure captures several deﬁnitions. First, function f maps eachoptimal cluster X ∗ i to its closest candidate center c j ∈ C . Candidate centerswith exactly one optimal cluster matched to them are denoted by H and forconvenience we assume that f maps X ∗ i to c i there. Then there are popularcenters with more than one match and for each such center we identify thecheapest matched cluster and put it in the set T . Finally, there are lonely centerswith no matches. Ignored optimal clusters are not matched to any center by f . Second, we construct a bijection b that is used later for a double countingargument. In this bijection, we set b ( i ) = f ( i ) = i for clusters i matched tocenters in H . Then, we omit cheap clusters T and popular candidate centers,and extend b to an arbitrary bijection between the remaining clusters and lonelycandidate centers. || x (cid:48) − c (cid:48) || ≥ || x − c (cid:48) || − || x (cid:48) − x ||≥ || x − c (cid:48) || − || x (cid:48) − c ∗ i || − || x − c ∗ i ||≥ √ − √ Θ − √ Θ x, x (cid:48) / ∈ X ∗ far > √ Θ 25nd therefore ϕ ( x (cid:48) , C ) > Θ. Hence, X ∗ i is ignored.Now we deﬁne the notion of reassignment cost, similarly as it was done in[LS19]. For each lonely candidate center c (cid:96) , reassign( X, C, c (cid:96) ) is an upper boundfor the increase in τ cost due to removing c (cid:96) from the current set of candidatecenters C . Similarly, for each matched candidate center c h , reassign( X, C, c h ) isan upper bound for the increase in cost due to removing c h from the current setof candidate centers C , ignoring the cost increase for all points in the optimalcluster X ∗ h . Deﬁnition 1.

Let h ∈ H be arbitrary and let X ∗ h be the optimal cluster that iscaptured by the center c h . The reassignment cost of c h is deﬁned asreassign ( X, C, c h ) = τ ( X ∗ in \ X ∗ h , C \ { c h } ) − τ ( X ∗ in \ X ∗ h , C ) + Θ | X far,h | + (Θ | outliers ( c h ) | − τ ( outliers ( c h ) , c h )) . For (cid:96) ∈ L , the reassignment cost of c (cid:96) is deﬁned asreassign ( X, C, c (cid:96) ) = τ ( X ∗ in , C \ { c (cid:96) } ) − τ ( X ∗ in , C ) + Θ | X far,(cid:96) | + (Θ | outliers ( c (cid:96) ) | − τ ( outliers ( c (cid:96) ) , c (cid:96) )) . Intuitively, the ﬁrst two terms in the deﬁnition correspond to the increaseof the cost for all the points in X ∗ in \ X ∗ h and X ∗ in , respectively. The third termcorresponds to the increase of the cost for all points in X ∗ far . This increase isbounded by Θ, their maximal cost. Finally, the last two terms in the deﬁnitionbelow upper bound the increase of the cost of outliers. Here we again crudelyupper bound the cost of each outlier after removing the candidate center by Θ. Fact 4.

For a matched cluster h we havereassign ( X, C, c h ) ≥ τ ( X \ X ∗ h , C \ { c h } ) − τ ( X \ X ∗ h , C ) . Similarly, for a lonely cluster (cid:96) we havereassign ( X, C, c (cid:96) ) ≥ τ ( X, C \ { c (cid:96) } ) − τ ( X, C ) . Proof.

See above discussion.

Lemma 6 (cf. Lemma 4 in [LS19]) . For i ∈ H ∪ L we havereassign ( X, C, c i ) ≤ τ ( in Θ ( X i , C ) , C ) + 24 ϕ ( in Θ ( X i , C ) , C ∗ ) + Θ | X far,i | + Θ | outliers ( c i ) | − τ ( outliers ( c i ) , c i ) . roof. We only present the case i ∈ H , as the case i ∈ L is similar and eveneasier. We observe that reassign( X, C, c i ) = τ ( X i \ X ∗ i , C \{ c i } ) − τ ( X i \ X ∗ i , C )+Θ | X far,i | + Θ | outliers( c i ) | − τ (outliers( c i ) , c i ), since vertices in clusters otherthan X i will still be assigned to their current center. Hence, it suﬃces to showthat τ ( X i \ X ∗ i , C \{ c i } ) − τ ( X i \ X ∗ i , C ) ≤ τ (in Θ ( X i , C ) , C )+24 τ (in Θ ( X i , C ) , C ∗ ) . As τ (out Θ ( X i \ X ∗ i , C ) , C \ { c i } ) − τ (out Θ ( X i \ X ∗ i , C ) , C ) = 0 , this is equivalent to showing that τ (in Θ ( X i \ X ∗ i , C ) , C \ { c i } ) − τ (in Θ ( X i \ X ∗ i , C ) , C ) ≤ τ (in Θ ( X i , C ) , C ) + 24 τ (in Θ ( X i , C ) , C ∗ ) . The cost contribution of each point in in Θ ( X i \ X ∗ i , C ) with respect to C isequal to the squared distance to the closest center in C and hence the analysis ofLattanzi and Sohler more or less directly applies. For the sake of completeness,we still present their analysis, which makes use of the following lemma.To upper bound the cost increase one incurs by removing c i , we assign eachpoint p ∈ in Θ ( X i \ X ∗ i , C ) ∩ X ∗ j , j (cid:54) = i , to the candidate center that captures thecluster X ∗ j . By computing an upper bound for the clustering cost with respect tothat assignment, we directly get an upper bound for τ (in Θ ( X i \ X ∗ i , C ) , C \ { c i } ).We upper bound the cost increase in two steps. First, we move each point p ∈ in Θ ( X i \ X ∗ i , C ) ∩ X ∗ j , j (cid:54) = i , to the cluster center c ∗ j of X ∗ j and we call theresulting multiset Q i . Let q p denote the point in Q i corresponding to p . Amongthe centers in C , the closest center to q p is now the center that captured theoptimal center X ∗ j . Thus, it is not equal to c i , as c i captures exactly one optimalcenter, namely X ∗ i . Thus, we obtain τ ( q p , C \ { c i } ) − τ ( p, C ) = τ ( q p , C ) − τ ( p, C ) q p not assigned to c i (2) ≤ τ ( p, C ) + 11 τ ( p, q p ) (Facts 2 and 3 with ε = 1 / τ ( p, C ) + 11 τ ( p, C ∗ ) . This implies τ ( q p , C \ { c i } ) = τ ( q p , C ) ≤ τ ( p, C ) + 11 τ ( p, C ∗ ) . (3)Next, we upper bound the cost increase by moving each point q p ∈ Q i backto its original position p . Formally, we bound27 ( p, C \ { c i } ) − τ ( q p , C \ { c i } ) (4) ≤ τ ( q p , C \ { c i } ) + 11 · τ ( p, q p ) Facts 2 and 3 ≤ (cid:18) τ ( p, C ) + 11 · τ ( p, C ∗ ) (cid:19) + 11 · τ ( p, C ∗ ) Eq. (3)= 11100 τ ( p, C ) + 13 · τ ( p, C ∗ )Combining the two inequalities Eqs. (2) and (4) yields τ ( p, C \ { c i } ) − τ ( p, C ) (5)= ( τ ( p, C \ { c i } ) − τ ( q p , C \ { c i } )) + ( τ ( q p , C \ { c i } ) − τ ( p, C )) ≤ τ ( p, C ) + 13 · τ ( p, C ∗ ) + 110 τ ( p, C ) + 11 · τ ( p, C ∗ ) Eqs. (2) and (4) ≤ τ ( p, C ) + 24 · τ ( p, C ∗ ) . Summing up the inequality Eq. (5) for all points p ∈ in Θ ( X i \ X ∗ i , C ) thenyields τ (in Θ ( X i \ X ∗ i , C ) , C \ { c i } ) − τ (in Θ ( X i \ X ∗ i , C ) , C ) ≤ τ (in Θ ( X i \ X ∗ i , C ) , C ) + 24 τ (in Θ ( X i \ X ∗ i , C ) , C ∗ ) ≤ τ (in Θ ( X i \ X ∗ i , C ) , C ) + 24 ϕ (in Θ ( X i , C ) , C ∗ )and thereforereassign( X, C, c i ) ≤ τ (in Θ ( X i , C ) , C ) + 24 ϕ (in Θ ( X i , C ) , C ∗ )+ Θ | X far,i | + Θ | outliers( c i ) | − τ (outliers( c i ) , c i ) . as needed. C.1 Decreasing Cost to O (1 /ε ) In this subsection, we will prove that while the cost of our solution is Ω(OPT /ε ),one local search step decreases the cost by a factor of 1 − Θ(1 /k ) with constantprobability (Lemma 8). This means that after ˜ O ( k ) steps, we have τ Θ ( X, C ) = O (OPT /ε ) with constant probability (Lemma 9). This, by itself, is the easypart and our proof is essentially the analysis of Lattanzi and Sohler [LS19]. The28ard part is then the generalization to Lemma 11 in Section C.2, which, roughlyspeaking, shows that while we have (1 + ε ) z points with a big cost, we still dosubstantial progress in each step.For the next deﬁnition, recall that b is the bijection between optimal clustersand candidate cluster centers that we deﬁned earlier. Deﬁnition 2.

A cluster index i ∈ [ k ] \ T is called awesome , if τ ( X ∗ i , C ) − reassign ( X, C, c b ( i ) ) − τ ( X ∗ i , c ∗ i ) > τ ( X, C )100 k .

Intuitively, awesome clusters are such that sampling a point c from themleads to high improvement in cost if the sampled center is sampled close to themean c ∗ i . In particular, if τ ( X ∗ i , c ) ≤ τ ( X ∗ i , c ∗ i ), swapping c b ( i ) with c leads toimprovement of τ ( X, C ) / (100 k ) in cost. The next lemma shows that we samplefrom an awesome cluster with constant probability. Lemma 7.

Let C denote the current set of candidate centers. Let ε ∈ (0 , bearbitrary and β := ε . If τ ( X, C ) ≥ β OPT and Θ ≤ β OPT z , then (cid:88) i ∈ [ k ] \ T , i is awesome τ ( X ∗ i , C ) ≥ τ ( X, C ) . Proof.

We bound the cost of clusters that are not awesome. (cid:88) i ∈ [ k ] \ T , i is not awesome τ ( X ∗ i , C ) ≤ (cid:88) i ∈ [ k ] \ T reassign( X, C, c b ( i ) ) + 9OPT + τ ( X, C )100 Deﬁnition 2 ≤ (cid:88) i ∈ [ k ] \ T (cid:16) τ (in Θ ( X b ( i ) , C ) , C ) + 24 ϕ (in Θ ( X b ( i ) , C ) , C ∗ ) + Θ | X far,b ( i ) | Lemma 6+ Θ · | outliers( c b ( i ) ) | − τ (outliers( c b ( i ) ) , c b ( i ) ) (cid:17) + 9OPT + τ ( X, C )100 ≤ (cid:88) i ∈ H ∪ L (cid:16) τ (in Θ ( X i , C ) , C ) + 24 ϕ (in Θ ( X i ) , C ) , C ∗ ) + Θ | X far,i | + Θ · | outliers( c i ) | − τ (outliers( c i ) , c i ) (cid:17) + 9OPT + τ ( X, C )100 b ([ k ] \ T ) = H ∪ L ≤ τ ( X, C ) + 24OPT + OPT + z Θ + 9OPT + τ ( X, C )100 Θ | X far,i | ≤ OPT , X i ⊆ X ∗ in ≤ τ ( X, C ) + (2 β + 34)OPT Θ ≤ β OPT z ≤ τ ( X, C ) . β ≥ , τ ( X, C ) ≥ β OPT29bserve that from the deﬁnition of cheap clusters it follows that (cid:88) i ∈ T τ ( X ∗ i , C ) ≤ τ ( X, C ) (6)because every cheap cluster can be matched with a diﬀerent cluster with at leastas big cost that is also matched to the same popular cluster center.Finally, we bound (cid:88) i ∈ [ k ] \ T , i is awesome τ ( X ∗ i , C )= τ ( X ∗ in , C ) − (cid:88) i ∈ [ k ] \ T ,i is not awesome τ ( X ∗ i , C ) − (cid:88) i ∈ T τ ( X ∗ i , C ) ≥ τ ( X, C ) − τ ( X ∗ out , C ) − τ ( X ∗ far , C ) − τ ( X, C ) − τ ( X, C ) Eq. (6) ≥ τ ( X, C ) − z Θ − OPT τ ( X ∗ out , C ) ≤ z Θ , τ ( X ∗ far , C ) ≤ OPT ≥ τ ( X, C ) . τ ( X, C ) ≥ β OPT , Θ ≤ β OPT z Lemma 8.

Let C denote the current set of candidate centers. Let ε ∈ (0 , bearbitrary and β := ε . Suppose that τ ( X, C ) ≥ β · OPT and Θ ≤ β OPT z .Then, with probability at least / , one local search step of Algorithm 4 resultsin a new set of candidate centers C (cid:48) with τ ( X, C (cid:48) ) ≤ (1 − / (100 k )) τ ( X, C ) .Proof. Let c (cid:48) denote the point sampled by Local-search++ with outliers . Theprobability that c (cid:48) ∈ X ∗ i for some i with i being an awesome cluster index is atleast (1 / τ ( X, C ) τ ( X, C ) = 110according to Lemma 7. Let i denote an arbitrary awesome cluster index. Lemma 4implies that E [ τ ( X ∗ i , c (cid:48) ) | c (cid:48) ∈ X ∗ i ] ≤ · τ ( X ∗ i , c ∗ i ) . Hence, by a simple application of Markov’s inequality, we obtainP[ τ ( X ∗ i , c (cid:48) ) ≤ · τ ( X ∗ i , c ∗ i ) | c (cid:48) ∈ X ∗ i ] ≥ . Putting things together, the probability that the sampled point c (cid:48) is containedin X ∗ i for some awesome cluster index i and it additionally holds that τ ( X ∗ i , c (cid:48) ) ≤ · τ ( X ∗ i , c ∗ i ) (7)30s at least 110 · > . Consider the case that i ∈ H . We can upper bound τ ( X, C (cid:48) ) as follows: τ ( X, C (cid:48) ) ≤ τ ( X, ( C \ { c b ( i ) } ) ∪ { c (cid:48) } ) swap c b ( i ) and c (cid:48) = τ ( X, C ) + (cid:0) τ ( X, ( C \ { c b ( i ) } ) ∪ { c (cid:48) } ) − τ ( X, C ) (cid:1) = τ ( X, C ) + τ ( X \ X ∗ i , ( C \ { c b ( i ) } ) ∪ { c (cid:48) } ) − τ ( X \ X ∗ i , C )+ τ ( X ∗ i , ( C \ { c b ( i ) } ) ∪ { c (cid:48) } ) − τ ( X ∗ i , C ) ≤ τ ( X, C ) + reassign(

X, C, c i ) + τ ( X ∗ i , c (cid:48) ) − τ ( X ∗ i , C ) Fact 4 ≤ τ ( X, C ) + reassign(

X, C, c i ) + 9 τ ( X ∗ i , c ∗ i ) − τ ( X ∗ i , C ) Eq. (7) ≤ τ ( X, C ) − τ ( X, C )100 k Deﬁnition 2The case i / ∈ H is very similar. Thus, we obtain that with probability atleast 1 /

100 we have τ ( X, C (cid:48) ) ≤ (1 − / (100 k )) τ ( X, C ) , as desired. Lemma 9.

Let C denote the current set of candidate centers. Let ε ∈ (0 , bearbitrary and β := ε . Let α > be arbitrary such that τ ( X, C ) ≤ α · β · OPT.Furthermore, assume that Θ ≤ β OPT z . Let C := C and for t ≥ , let C t +1 denote the set of centers obtained by running Algorithm 4 with input centers C t .Then, with positive constant probability p > , we have τ ( X, C T ) ≤ β OPTwith T := (cid:100) · k log α (cid:101) = O ( k log( α )) .Proof. For t ≥

0, let X t denote the indicator variable for the event that τ ( X, C t +1 ) > max((1 − / (100 k )) τ ( X, C t ) , β OPT). According to Lemma 8, E [ X t ] ≤ . Hence, with X := (cid:80) T − t =0 X t , we have E [ X ] ≤ T . Thus,Markov’s Inequality implies that there exists a constant p > P r [ X ≤ T ] ≥ p . In that case, there are at least 100 k log α iterations t forwhich X t = 0 and therefore τ ( X, C T ) ≤ max((1 − / (100 k )) k log α τ ( X, C ) , β OPT) ≤ max( e − log α αβ OPT , β OPT)= 100 β OPT , as desired. 31 .2 Decreasing Number of Outliers to (1 + ε ) z We are now ready to prove Lemma 11. It roughly states that with probabilityΩ( ε ) we suﬃciently improve our solution, provided that there are at least (1 + ε ) z points with a squared distance of at least 10Θ to the current solution. Lemma 12then follows as a simple consequence of Lemma 11. Roughly speaking, it statesthat starting with a set of centers C with a cost of O ((1 /ε )OPT), within O ( k/ε )local search steps, one obtains a set of centers C (cid:48) such that at most (1 + ε ) z points have a squared distance of at least 10Θ to C (cid:48) , with positive constantprobability.Let g be the number of points in X ∗ out that have a squared distance of lessthan 10Θ to the closest center in C . Note that by our deﬁnition of outliers( c ),we have (cid:88) i ∈ [ k ] | outliers( c i ) | = g ≤ z. (8) Deﬁnition 3.

A cluster index i ∈ [ k ] \ T is called good , if τ ( X ∗ i , C ) − reassign ( X, C, c b ( i ) ) − τ ( X ∗ i , c ∗ i ) > τ ( X, C ) − z Θ100 k .

Lemma 10.

Let C denote the current set of candidate centers. Let ε ∈ (0 , bearbitrary and β := ε . If for some ε (cid:48) ≥ ε , there are (1 + ε (cid:48) ) z diﬀerent pointswith a squared distance of at least to the closest candidate center in C and Θ ≥ β OPT z , then (cid:88) i ∈ [ k ] \ T , i is good τ ( X ∗ i , C ) ≥ τ ( X, C ) − z Θ10 . Proof.

Recall our assumption that exactly (1+ ε (cid:48) ) z points have a squared distanceof at least 10Θ to the set C of candidate centers, i.e., | out ( X, C ) | = (1 + ε (cid:48) ) z. (9)Moreover, we deﬁned g as the number of points in X ∗ out such that they havesquared distance of less than 10Θ to C , so | out ( X ∗ out , C ) | = z − g. (10)Hence | out ( X ∗ in , C ) | (11)= | out ( X, C ) | − | out ( X ∗ out , C ) | − | out ( X ∗ far , C ) |≥ (1 + ε (cid:48) ) z − ( z − g ) − | X ∗ far | Eqs. (9) and (10) ≥ ε (cid:48) z + g − OPTΘ . OPT ≥ | X ∗ far | Θ32e will use the following bound τ ( X, C ) − z Θ100 (12) ≤ τ ( X ∗ in , C ) + τ ( X ∗ far , C ) + τ (in ( X ∗ out , C ) , C ) − g Θ100 ≤ τ ( X ∗ in , C ) + τ (in ( X ∗ out , C ) , C ) − g Θ100 + OPT / . Now we bound (cid:88) i ∈ [ k ] \ T , i is not good τ ( X ∗ i , C ) ≤ (cid:16) (cid:88) i ∈ [ k ] \ T reassign( X, C, c b ( i ) ) (cid:17) + 9OPT + τ ( X, C ) − z Θ100 Deﬁnition 3 ≤ (cid:88) i ∈ [ k ] \ T (cid:16) τ (in Θ ( X b ( i ) , C ) , C ) + 24 ϕ (in Θ ( X b ( i ) , C ) , C ∗ )+ Θ | X far,b ( i ) | + Θ · | outliers( c b ( i ) ) |− τ (outliers( c b ( i ) ) , c b ( i ) ) (cid:17) + 9OPT + τ ( X, C ) − z Θ100 Lemma 6= (cid:88) i ∈ H ∪ L (cid:16) τ (in Θ ( X i , C ) , C ) + 24 ϕ (in Θ ( X i , C ) , C ∗ )+ Θ | X far,i | + Θ · | outliers( c i ) |− τ (outliers( c i ) , c i ) (cid:17) + 9OPT + τ ( X, C ) − z Θ100 b ([ k ] \ T ) = H ∪ L ≤ τ (in Θ ( X ∗ in , C ) , C ) + 24OPT + OPT+ (cid:88) i ∈ H ∪ L (Θ · | outliers( c i ) | − τ (outliers( c i ) , c i )) + 9OPT+ τ ( X, C ) − z Θ100 X i ⊆ X ∗ in ≤ τ (in Θ ( X ∗ in , C ) , C ) + (cid:88) i ∈ [ k ] (Θ · | outliers( c i ) | − τ (outliers( c i ) , c i ))+ 34OPT + τ ( X, C ) − z Θ100= 21100 τ (in Θ ( X ∗ in , C ) , C ) + g Θ − τ (in ( X ∗ out , C ) , C )+ 34OPT + τ ( X, C ) − z Θ100 Eq. (8) ≤ τ (in Θ ( X ∗ in , C ) , C ) + g Θ − τ (in ( X ∗ out , C ) , C ) + 34OPT+ τ ( X ∗ in , C ) + τ (in ( X ∗ out , C ) , C ) − g Θ100 + OPT /

100 Eq. (12)33 τ (in Θ ( X ∗ in , C ) , C ) + Θ g − τ (in ( X ∗ out , C ) , C )+ 35OPT + τ ( X ∗ in , C ) − g Θ100Next, we make use of the inequality (cid:88) i ∈ T τ ( X ∗ i , C ) ≤ (cid:88) i ∈ [ k ] ,i is not ignored τ ( X ∗ i , C ) . (13)This follows as a cheap cluster is not ignored, and each cheap cluster can bematched with a diﬀerent cluster that is not ignored, has at least the same costand is matched to the same popular cluster center.Crucially, we now use Lemma 5 that says that whenever a cluster X ∗ i is notignored, we have X ∗ i = in ( X ∗ i , C ).Putting things together, we get (cid:88) i ∈ [ k ] \ T , i is good τ ( X ∗ i , C )= τ ( X ∗ in , C ) − (cid:88) i ∈ [ k ] \ T , i is not good τ ( X ∗ i , C ) − (cid:88) i ∈ T τ ( X ∗ i , C ) ≥ τ (in ( X ∗ in , C ) , C ) + τ (out ( X ∗ in , C ) , C ) − (cid:16) τ (in Θ ( X ∗ in , C ) , C ) + Θ g − τ (in ( X ∗ out , C ) , C )+ 35OPT + τ ( X ∗ in , C ) − g Θ100 (cid:17) − (cid:88) i ∈ [ k ] , i is not ignored τ ( X ∗ i , C ) Eq. (13) ≥ τ (in ( X ∗ in , C ) , C ) + (( ε (cid:48) z + g )Θ − OPT) − (cid:18) Θ g − τ (in ( X ∗ out , C ) , C ) + 35OPT + τ ( X ∗ in , C ) − g Θ100 (cid:19) − τ (in ( X ∗ in , C ) , C ) Lemma 5 ≥ τ (in ( X ∗ in , C ) , C ) + 110 τ (in ( X ∗ out , C ) , C ) + ε (cid:48) z Θ − − τ (out ( X ∗ in , C ) , C ) − g Θ100 ≥ τ (in ( X, C ) , C ) + ε (cid:48) z Θ − − ε (cid:48) z Θ100 Eq. (11) ≥

110 ( τ (in ( X, C ) , C ) + ε (cid:48) z Θ) Θ ≥ εz = 110 ( τ (in ( X, C ) , C ) + τ (out ( X, C ) , C ) − z Θ) Eq. (9)= τ ( X, C ) − z Θ10 ,

34s needed.

Lemma 11.

Let C denote the current set of candidate centers. Let ε ∈ (0 , bearbitrary and β := ε . Suppose that for out current candidate centers C we haveat least (1 + ε ) z points in X that have a squared distance of at least to theclosest center in C . Furthermore, assume that Θ ≥ β OPT z . Then, with probabilityat least ε/ , one step of Local-search++ with outliers results in a new setof candidate centers C (cid:48) with τ ( X, C (cid:48) ) − z Θ ≤ (1 − / (100 k )) ( τ ( X, C ) − z Θ) .Proof. Let c (cid:48) denote the point sampled by Local-search++ with outliers . Theprobability that c (cid:48) ∈ X ∗ i for some i with i being a good cluster index is at least τ ( X,C ) − z Θ10 τ ( X, C ) ≥ (1 + ε ) z Θ − z Θ10(1 + ε ) z Θ τ ( X, C ) ≥ (1 + ε ) z Θ ≥ ε . ε ≤ i denote an arbitrary good cluster index. Lemma 4 implies that E [ τ ( X ∗ i , c (cid:48) ) | c (cid:48) ∈ X ∗ i ] ≤ · τ ( X ∗ i , c ∗ i ) . Hence, by a simple application of Markov’s inequality, we obtainP[ τ ( X ∗ i , c (cid:48) ) ≤ · τ ( X ∗ i , c ∗ i ) | c (cid:48) ∈ X ∗ i ] ≥ . Putting things together, the probability that the sampled point c (cid:48) is containedin X ∗ i for some good cluster index i and it additionally holds that τ ( X ∗ i , c (cid:48) ) ≤ · τ ( X ∗ i , c ∗ i ) (14)is at least ε · ≥ ε . First, consider the case that i ∈ H . In that case, we can upper bound τ ( X, C (cid:48) ) as follows: τ ( X, C (cid:48) ) ≤ τ ( X, ( C \ { c b ( i ) } ) ∪ { c (cid:48) } ) swap c b ( i ) and c (cid:48) = τ ( X, C ) + (cid:0) τ ( X, ( C \ { c b ( i ) } ) ∪ { c (cid:48) } ) − τ ( X, C ) (cid:1) = τ ( X, C ) + τ ( X \ X ∗ i , ( C \ { c b ( i ) } ) ∪ { c (cid:48) } ) − τ ( X \ X ∗ i , C )+ τ ( X ∗ i , ( C \ { c b ( i ) } ) ∪ { c (cid:48) } ) − τ ( X ∗ i , C ) ≤ τ ( X, C ) + reassign(

X, C, c i ) + τ ( X ∗ i , c (cid:48) ) − τ ( X ∗ i , C ) Fact 4 ≤ τ ( X, C ) + reassign(

X, C, c i ) + 9 τ ( X ∗ i , c ∗ i ) − τ ( X ∗ i , C ) Eq. (14) ≤ τ ( X, C ) − τ ( X, C ) − z Θ100 k .

Deﬁnition 335he case i / ∈ H is very similar. Thus, we obtain τ ( X, C (cid:48) ) − z Θ ≤ (cid:18) τ ( X, C ) − τ ( X, C ) − z Θ100 k (cid:19) − z Θ= (1 − / (100 k )) ( τ ( X, C ) − z Θ) , as desired. Lemma 12.

Let C denote the current set of candidate centers. Let ε ∈ (0 , be arbitrary and β := ε . Let α > be arbitrary such that τ ( X, C ) ≤ α OPT.Assume that Θ ≥ β OPT z . Let C := C and for t ≥ , let C t +1 denote the set ofcenters obtained by running Local-search++ with outliers with input centers C t . Then, with positive constant probability p > , there are at most (1 + ε ) z points in X that have a distance of at least to the closest candidate centerin C t for some t ≤ T with T = O (log( α ) k/ε ) .Proof. For t ≥

0, we deﬁne the potential Φ( t ) = τ ( X, C t ) − z Θ. Now, let Y t denote the indicator variable for the event that there are less than (1 + ε ) z points that have a squared distance of at least 10Θ to the closest center in C t orΦ( t + 1) ≤ (1 − / (100 k )) Φ( t ). Note that the random variables Y , Y , . . . , Y T − are not independent. However, Lemma 11 implies that E [ Y i | Y , Y , . . . , Y i − ] ≥ ε i ∈ { , . . . , T − } and every realization of random variables Y , Y , . . . Y i − . Thus, the random variable Y = (cid:80) T − i =0 Y i stochasticallydominates the random variable Y (cid:48) = (cid:80) T − i =0 Y (cid:48) i , where the ( Y (cid:48) i )’s areindependent Bernoulli variables that are equal to 1 with probability ε . Now,let T (cid:48) := (cid:100) k · log( α ) + 1 (cid:101) and T := (cid:100) T (cid:48) ε/ (cid:101) . By a standard application of theChernoﬀ bound in Theorem 7, we getP[ Y (cid:48) ≤ T (cid:48) ] ≤ P [ Y (cid:48) ≤ (1 − / E [ Y (cid:48) ]] ≤ e − Ω( E [ Y (cid:48) ]) = e − Ω(1) . Thus, there exists a positive constant p such that with probability at least p ,we have P[ Y ≥ T (cid:48) ] ≥ p. Assume that Y ≥ T (cid:48) . First, consider the case that there exists a t ∈{ , . . . , T − } for which there are less than (1 + ε ) z points in X that have asquared distance of at least 10Θ to the closest center in C t . In that case we aredone. Thus, assume that this does not happen. In particular, this implies thatΦ( T − ≥ (1 + ε ) z Θ − z Θ = ε Θ z > OPT . However, we also have 36( T − ≤ (1 − / (100 k )) T (cid:48) − Φ(0) ≤ (1 − / (100 k )) k · log( α ) α OPT ≤ OPT , a contradiction. This concludes the proof. C.3 Putting things together

Lemma 13.

Consider some iteration i for which OPT ≤ OPT guess ≤ OPT.Then, with positive constant probability, there exists some iteration j of the innerloop such that | X i,jout | ≤ (1 + ε ) z and ϕ ( X \ X i,jout , C i,j ) = O (1 /ε ) OPT.Proof.

According to [BVX19, Li20], running Algorithm 2 results in a set ofcenters C i, such that E [ τ ( X, C i, )] = O (log k ) · min C (cid:48) : | C (cid:48) | = k τ ( X, C (cid:48) ) ≤ O (log k ) (OPT + z · Θ) ≤ O (log k ) β OPT . By using Markov’s inequality, we therefore have τ ( X, C i, ) = O (log k ) β OPTwith positive constant probability. Conditioned on that event, we use Lemma 9to deduce that τ ( X, C i,T ) ≤ β OPT for some T = O ( k log( O (log k ))) = O ( k log log k ) with positive constant probability.Then, we use Lemma 12 to deduce that with positive constant probability, thereexists some T (cid:48) ∈ O (log(1 /ε ) k/ε ) such that there are at most (1 + ε ) z points in X with a squared distance of at least 10Θ to the closest center in C i,T + T (cid:48) andfurthermore τ ( X, C i,T + T (cid:48) ) ≤ τ ( X, C i,T ) ≤ β OPT = O (1 /ε )OPT . Hence, | X i,T + T (cid:48) out | ≤ (1 + ε ) z and ϕ ( X \ X i,T + T (cid:48) out , C i,T + T (cid:48) ) ≤ · τ ( X \ X i,T + T (cid:48) out , C i,T + T (cid:48) ) ≤ O (1 /ε )OPTwith T + T (cid:48) = O ( k log log k +log(1 /ε ) k/ε ), as set in the inner loop of Algorithm 4.Finally, we are ready to ﬁnish the analysis of Algorithm 4 by provingTheorem 9. Proof of Theorem 9. As k + z < n , we have1 ≤ min C (cid:48) ,X (cid:48) out : | C (cid:48) | = k, | X (cid:48) out | = z ϕ ( X \ X (cid:48) out , C ) ≤ n ∆ . For i = 0, we have 2 i = 1 and for i = (cid:100) log( n ∆ )) (cid:101) , we have37 i = 2 (cid:100) log( n ∆ )) (cid:101) ≥ n ∆ . Thus, there exists some i ∈ { , , . . . , (cid:100) log( n ∆ ) (cid:101)} such that in the i -thiteration we have OP T guess ∈ [OPT , D Using the Metropolis-Hastings algorithm tospeed up Algorithm 2

Here we explain how to speed up Algorithm 2 by using the Metropolis-Hastingsalgorithm. This was ﬁrst done in [BLHK16b, BLHK16a] for the classical k -means++ algorithm, but there it only leads to rather weak additive errorguarantee, which turns out not to be the case for k -means with outliers.We alter every step of Algorithm 2 as follows. Instead of sampling a newpoint proportional to its τ Θ -weight, we instead sample a new point from thefollowing Markov chain: First, we sample a point x according to a proposal distribution q . Then, in the following T steps, we always sample a new point y from q and set x ← y with probability min (cid:16) , q ( x ) π ( y ) q ( y ) π ( x ) (cid:17) , where π is the targetdistribution, so in our case, the τ Θ ( · , C ) /τ Θ ( X, C )-distribution. In other words,we deﬁne a Markov chain with transition probability P ( x, y ) = q ( y ) · min (cid:18) , q ( x ) π ( y ) q ( y ) π ( x ) (cid:19) . In [BLHK16b], a uniform proposal distribution q ( x ) = 1 /n for all x ∈ X waschosen. The advantage of the uniform distribution is that one can samplefrom it easily, but only arguably rather weak guarantees are known for theresulting algorithm in the context of speeding up k -means ++ . In [BLHK16a], q ( x ) = ϕ ( x, c ) /ϕ ( X, c ) for a random point c was chosen. This distribution can beprecomputed in O ( n ) time and the guarantee of the resulting algorithm matchesthe guarantee of the original k -means ++ algorithm up to an additional additiveerror term δϕ ( X, µ ( X )), if T = ˜ O (1 /δ ) is chosen. Remark 1.

One can see (but we will not prove it here) that the analysis of[BLHK16a] directly generalizes to τ Θ -costs for each non-negative Θ . Choosing Θ =

OPT /z , one obtains δτ Θ ( X, µ ( X )) ≤ δ · n Θ = δn OPT z . Hence, the choiceof δ = O ( z/n ) yields that the additional incurred error is of the same orderas OPT. Hence, generalizing the result of [BLHK16a] and using Lemma 1, weobtain an ( O (log k ) , O (log k )) -approximation algorithm with time complexity of ˜ O ( n + k · ( k/δ )) = ˜ O ( n + nk /z ) , because we need to precompute the proposaldistribution q in O ( n ) time and sampling one point takes ˜ O (1 /δ ) steps, eachimplementable in time O ( k ) . Changing proposal distribution to uniform wouldrecover the same guarantees. While in the case of classical k -means, one needs to use the more complicatedproposal distribution q ( x ) = ϕ ( x, c ) /ϕ ( X, c ) to get some guarantees, it turns38ut that for k -means with outliers, uniform proposal suﬃces. To get Theorem 5,we start by proving that Algorithm 2 with (cid:96) = O ( k ) can be sped up withMetropolis-Hastings algorithm with uniform proposal. Theorem 10.

A version of Algorithm 2 with (cid:96) = O ( k ) , Θ = Θ(

OPT /z ) andwhere each sampling step is approximated by the Metropolis-Hastings algorithmrunning for T = O ( n/z ) steps and a uniform proposal distribution will runin time ˜ O ( nk z ) and produce a solution that with positive constant probabilitysatisﬁes τ Θ ( X ∗ in , C ) = O ( OPT ) . To prove Theorem 10, we ﬁrst recall a classical result about the Metropolis-Hastings algorithm [MRR +

53, Liu96]. If the proposal distribution q and thetarget distribution π satisfy q ( x ) ≥ δπ ( x ) for all x and some δ >

0, then after T = O (1 /δ ) steps, the distribution of the Metropolis-Hastings chain dominates π/ Lemma 14 (one Metropolis-Hastings step [Liu96]) . Let π denote the targetdistribution we want to approximate with Metropolis-Hastings and p t = απ + (1 − α ) p (cid:48) t for some distribution p (cid:48) t . Let q be a proposal distribution with q ( x ) ≥ δπ ( x ) for all x . Then, after one step of Metropolis-Hastings, we are in a distribution p t +1 which can be written as p t +1 = α (cid:48) π + (1 − α (cid:48) ) p (cid:48) t +1 for α (cid:48) = α + δ (1 − α ) .Proof. For any ﬁxed vertex x , given that x is the current vertex after t steps,the probability of moving from x to y is P ( x, y ) = q ( y ) · min (cid:18) , q ( x ) π ( y ) q ( y ) π ( x ) (cid:19) . Now, if q ( x ) π ( y ) ≥ q ( y ) π ( x ), we have P ( x, y ) = q ( y ) ≥ δπ ( y )by the deﬁnition of δ . On the other hand, if q ( x ) π ( y ) < q ( y ) π ( x ), we have P ( x, y ) = q ( x ) π ( y ) π ( x ) ≥ q ( x ) π ( y ) q ( x ) /δ = δπ ( y ) . So, for any x and any y we have P ( x, y ) ≥ δπ ( y ), and since π is a stationarydistribution, for any x we have p t +1 ( x ) ≥ απ ( x ) + (1 − α ) δπ ( x ), as needed.We now prove Theorem 10. Proof.

Fix one sampling step of Algorithm 2 and suppose that τ ( X ∗ in , C ) ≥ T = O ( n/z ), we will show that we sample from a distribution p T such that for π ( x ) = τ ( x, C ) /τ ( X, C ) we have p T ( x ) ≥ π ( x ) / x . Hence,we may interpret sampling from p T as sampling from π with probability 1 / O ( k · nz · k ), as we need to sample O ( k ) points in total, O ( n/z ) iterations of the Metropolis-Hastings algorithm arenecessary for each point sample and in each step we need to compute τ Θ ( x, C ),which can be done in time proportional to the number of points already sampled.We now prove that p T ( x ) ≥ π ( x ) / x , provided thatΘ ∈ [OPT /z, /z ]. First note that π ( x ) = τ ( x, C ) /τ ( X, C ) ≤ Θ /τ ( X, C ),so for δ = zn we have for any x that δπ ( x ) ≤ zn /zτ ( X, C ) = q ( x ) 2OPT τ ( X, C ) ≤ q ( x ) . Hence, we may use Lemma 14 to conclude that after O (1 /δ ) = O ( n/z ) steps, p T ( x ) ≥ π ( x ) / x , as needed.We are now ready to prove Theorem 5 that we restate here for convenience. Theorem 5.

There is an ( O (1) , O (1)) -approximation algorithm for k -meanswith outliers that runs in time ˜ O ( nk · min(1 , k/z )) and succeeds with positiveconstant probability.Proof. If z = O ( k log k ) or z = O (log n ), we can run Algorithm 4 with ε = 1 toobtain an ( O (1) , O (1))-approximation in time ˜ O ( nk ) (Theorem 9). Otherwise,we run the Metropolis-Hastings based variant of Algorithm 2 from Theorem 10with (cid:96) = O ( k ) that runs in ˜ O ( nk /z ) time for O (log n ) values Θ = 2 /z, Θ =2 /z, . . . , Θ log(∆ n ) = ∆ n/z . Hence, we get a sequence of sets of centers Y , Y , . . . with | Y i | = O ( k ). Theorem 10 ensures that there will exist one value˜Θ with OPT / (2 z ) ≤ ˜Θ ≤ OPT /z in the set of values Θ , . . . , Θ log(∆ n ) suchthat the sped up Algorithm 2 returns, with constant probability, a set Y with τ ˜Θ ( X ∗ in , Y ) = O (OPT).In Theorem 11 we give an algorithm with running time ˜ O ( nk /z ) that turnsthe pair ( ˜ Y , ˜Θ) into a set of cluster centers ˜ C, | ˜ C | = k , such that ϕ − Az ( X, ˜ C ) = O (OPT) for some universal constant A , with positive constant probability.We apply the algorithm from Theorem 11 to all pairs ( Y i , Θ i ) and get asequence of O (log n ) sets C , C , . . . with | C i | = k such that, by Theorem 11, withpositive constant probability it contains a set ˜ C with ϕ − Az ( X, ˜ C ) = O (OPT).To conclude, we describe how in time ˜ O ( nk/z ) we can get an estimator ζ C of ϕ − Az ( X, C ) for some set of k centers C such that the following two propertiesare satisﬁed. ζ C ≤ ϕ − Az ( X, C ) with probability 1 − / poly( n ) (15)and ζ C ≥ ϕ − Az ( X, C ) with probability 1 − / poly( n ). (16)40hen, we can simply estimate the cost of each set of centers C i and returnthe one with the smallest estimated cost. With positive constant probability,there is at least one set of centers C i with ϕ − Az ( X, C i ) = O (OPT). Hence, byEq. (15) there also exists an estimator ζ C i , with positive constant probability,such that ζ C i = O (OPT). In that case, we choose a set of centers C A with ϕ − Az ( X, C A ) ≤ ζ C A = O (OPT) with constant positive probability, byEq. (16).The estimator ζ C for ϕ − Az ( X, C ) is constructed as follows. We sample N = O (( n/z ) log n ) points from X uniformly and independently at randomwith replacement. We denote the resulting multiset as X (cid:48) . Then, we computethe value ϕ − Az ( N/n ) ( X (cid:48) , C ) in time O ( | X (cid:48) | k ) = ˜ O ( nk/z ) and set ζ C = nN ϕ − Az ( N/n ) ( X (cid:48) , C ).We deﬁne X in as the closest Az points to the set C and X out = X \ X in .Let us split the points of X in into log ∆ sets X , X , . . . such that x ∈ X i if 2 i ≤ ϕ ( x, C ) ≤ i +1 . We say a set X i is big if | X i | ≥ z/ log ∆ and small otherwise. The intuition here is that either a set X i is big enough so that thenumber of sampled points from X i is concentrated around its expectation, orthe set is small enough with respect to the number of outliers z we consider.More precisely, for every big set X i we can compute that E [ | X i ∩ X (cid:48) | ] ≥ zn log ∆ · Ω( nz log ( n )) = Ω(log n ), so by Theorem 7, we haveP ( E [ | X i ∩ X (cid:48) | ] / ≤ | X i ∩ X (cid:48) | ≤ E [ | X i ∩ X (cid:48) | ]) (17) ≥ − e − Θ( | X i ∩ X (cid:48) | ) = 1 − / poly( n ) . Similarly, we can computeP ( E [ | X out ∩ X (cid:48) | ] / ≤ | X out ∩ X (cid:48) | ≤ E [ | X out ∩ X (cid:48) | ]) (18) ≤ − e − Θ( | X out ∩ X (cid:48) | ) = 1 − / poly( n ) . Furthermore, we have | ∪ i is small X i | ≤ log(∆) z log(∆) = z . As zn · N = Ω(log n ),we can conclude that X (cid:48) contains at most 2 zn · N points from ∪ i is small X i withprobability 1 − / poly( n ). We now condition on the events from Eq. (17) andEq. (18) and we additionally condition on the event that X (cid:48) contains at most2 zn · N points from ∪ i is small X i .We ﬁrst prove Eq. (15). We have ζ C = nN ϕ − Az ( N/n ) ( X (cid:48) , C ) ≤ ϕ − Az ( X, C ).By Eq. (18), we have | X out ∩ X (cid:48) | ≤ Az ( N/n ), so we can label points in X out ∩ X (cid:48) as outliers. Furthermore, we can additionally label all points in ( ∪ i is small X i ) ∩ X (cid:48) as outliers, as | ( ∪ i is small X i ) ∩ X (cid:48) | ≤ z ( N/n ). The number of sampled pointsfrom each big X i is at most 2 ( N/n ) | X i | by Eq. (17) and the costs of points in X i diﬀer only by a 2-factor.Next, we prove Eq. (16). Let z (cid:48) i be the number of outliers one needs tochoose from X i ∩ X (cid:48) to optimize the cost ϕ − Az ( N/n ) ( X (cid:48) , C ). Let us deﬁne a41et of outliers X A out such that it contains the set X out , all points from smallsets X i and 2 z (cid:48) i ( n/N ) arbitrary points from each big set X i . Then, | X A out | ≤ z + z + 2(4 Az ( N/n ))( n/N ) ≤ Az . To bound the cost ϕ ( X \ X A out , C ), webound the contribution of each set X i . We have( | X i | − z (cid:48) i ( n/N )) · i +1 ≤ (2 · nN | X (cid:48) ∩ X i | − z (cid:48) i nN )2 i +1 Eq. (17) ≤ nN i ( | X (cid:48) ∩ X i | − z (cid:48) i ) . Summing up these contributions, we get ϕ − Az ( X, C ) ≤ nN ϕ − Az ( N/n ) , asdesired.For the purpose of getting a set of centers that provides an ( O (1) , O (1))-approximation in ˜ O ( nk /z ) time, we next provide a simple adaptation of the “aclustering of a clustering is a clustering” result of [GMM + Theorem 11.

Let X be an input set, OPT the optimal solution to the k -meanswith z outliers objective on X and Θ such that OPT / (2 z ) ≤ Θ ≤ OPT /z .Suppose we have a set Y, | Y | = O ( k ) such that τ Θ ( X, Y ) = O ( OPT ) . Then,in time ˜ O ( nk /z ) we can compute a set C A , | C A | = k , such that τ Θ ( X, C A ) = O ( OPT ) with positive constant probability.Proof. Let C ∗ be the clustering of an optimal solution. Recall that τ Θ ( X, C ∗ ) ≤ OPT + z Θ = O (OPT). We subsample N = Θ( nkz log n ) points uniformly andindependently with replacement from X , thus obtaining a multiset of points X (cid:48) . For each point x ∈ X (cid:48) we compute its distance to Y . This is done in time˜ O ( nk /z ). For y ∈ Y we deﬁne B (cid:48) ( y ) as the set of points x ∈ X (cid:48) for which y isthe closest center and B ( y ) as the set of points x ∈ X for which y is the closestcenter. Moreover, we deﬁne ˆ w ( y ) = ( n/N ) | B (cid:48) ( y ) | and w ( y ) = | B ( y ) | . Note that E [ ˆ w ( y )] = w ( y ). We denote with y x the point y ∈ Y for which x ∈ B ( y ). Foreach y ∈ Y we either have w ( y ) < z/k , and then we call the respective set B ( y ) small , or w ( y ) ≥ z/k and the respective set B ( y ) is called big . For big sets wehave E [ | B (cid:48) ( y ) | ] ≥ ( N/n ) · z/k = Ω(log n ). In this case, an application of Chernoﬀbounds (Theorem 7) gives that w ( y ) / ≤ ˆ w ( y ) ≤ w ( y ) (19)with probability 1 − e − Ω( E [ | B (cid:48) ( y ) | ]) = 1 − / poly( n ). We condition that thisevent happens for all y ∈ Y . We know that (cid:80) y small w ( y ) = O ( z ), so E [ (cid:80) y small ˆ w ( y )] = O ( z ) and by Markov’s inequality (cid:80) y small ˆ w ( y ) = O ( z ) withconstant probability. Again, we condition on this event.We now run some O (1)-approximate algorithm for k -means with penaltieson the weighted instance ( Y, ˆ w ( y )) that runs in ˜ O ( n (cid:48) k ) time for input of size n (cid:48) (e.g., Algorithm 4 can be interpreted as such an algorithm by the easy part ofthe analysis captured by Lemma 9; the hard part of the analysis – Lemma 12 –then talks about the guarantees of the algorithm for k -means with outliers). Asour instance has n (cid:48) = ˜ O ( nk/z ) points, the time complexity of the algorithm will42e ˜ O ( nk /z ). The algorithm returns a set C A of k centers that we return. Wehave τ Θ (( Y, ˆ w ) , C ∗ ) (20) ≤ (cid:88) y small ˆ w ( y )Θ + (cid:88) y big ˆ w ( y ) τ Θ ( y, C ∗ ) ≤ O ( z ) · Θ + 2 (cid:88) y big w ( y ) τ Θ ( y, C ∗ ) Eq. (19) ≤ O (OPT) + 4 (cid:88) y big (cid:88) x ∈ B ( y ) τ Θ ( y, x ) + τ Θ ( x, C ∗ ) Fact 2 ≤ O (OPT) + O ( τ Θ ( X, Y )) + O ( τ Θ ( X, C ∗ )) = O (OPT) . Hence, τ Θ ( X, C A ) = (cid:88) x ∈ X τ Θ ( x, C A ) ≤ (cid:88) x ∈ X ( τ Θ ( x, y x ) + τ Θ ( y x , C A )) Fact 2= 2 τ Θ ( X, Y ) + 2 τ Θ (( Y, w ) , C A ) ≤ O (OPT) + 2 (cid:88) y small w ( y ) τ Θ ( y, C A ) + 2 (cid:88) y big w ( y ) τ Θ ( y, C A ) ≤ O (OPT) + O ( z ) · Θ + 4 (cid:88) y big ˆ w ( y ) τ Θ ( y, C A ) Eq. (19)= O (OPT) + O ( τ Θ (( Y, ˆ w ) , C ∗ )) = O (OPT) , Eq. (20)as needed.

Remark 2.

We believe, but do not prove, that reﬁning the above proofs andusing the results from Appendices C and F, we can reﬁne Theorem 5 to achieve ( O (1 / poly( ε )) , ε ) -approximation algorithm with running time ˜ O ( nk / ( z poly( ε ))) . E Lower bounds

In the following, we consider algorithms for the k -means/ k -median/ k -center withoutliers problems in the general metric space query model. The algorithm isonly required to output a set of k centers and does not need to explicitly declarepoints as outliers. Theorem 12 (Formal version of Theorem 6) . Let A denote a (randomized)algorithm for the k -means/ k -median/ k -center problem with outliers with a querycomplexity of o ( nk /z ) . Then, for large enough values k , z , and n such that k log k ≤ z ≤ n , there exists a concrete input instance for the k -means/ k -median/ k -center problem with parameters k , z and n such that A utputs an ( α, β ) -approximation with a probability less than . for every α, β ∈ O (1) . If one does not assume that the ratio between the maximum and minimumdistance is bounded, one can show that the statement even holds for an arbitrarymultiplicative factor α . For simplicity, we assume in the proof that β = 2, butthe proof works for an arbitrary constant β . Proof.

We roughly follow the proof from [Met02]. Applying Yao’s principle[Yao],it suﬃces to provide a distribution over metric spaces such that any deterministicalgorithm needs Ω( nk /z ) queries to succeed with probability 0 . k, z and n be given such that 10000 k log k ≤ z ≤ n andwe assume that k is suﬃciently large. Let C = { c , . . . , c k } be a set of pointssuch that the distance between any two points in C is 1. Next, we deﬁne aprobability distribution D over C . Our random instance then consists of amultiset of n points X = { x , . . . , x n } with each point being sampledindependently according to D . More precisely, we call S = { c , . . . , c k/ } the setof small points and B = { c k/ , . . . , c k } the set of big points (for simplicity, weassume that k is even). The probability of sampling each small point is set to zn · k/ and the probability of sampling each big point is set to n − zn · k/ . Note that k · zn · k/ + k · n − zn · k/ = 1. Note that it can (and will) happen that we samplethe same point twice, but the algorithm does not know it unless it asks theoracle M that for a query ( x, y ) returns the distance d ( x, y ) of the two points inthe constructed metric space. One could perturb this instance slightly in orderto get an instance with a lower bound on the minimum distance between twopoints. We refer to the multiset of points that are a clone of c j as a smallcluster if j ∈ [ k/

2] and as a big cluster otherwise. We denote with E i,j the eventthat the i -th point is a clone of c j .We consider an arbitrary deterministic algorithm A that asks at most T = nk z queries to the distance oracle M . We refer to a query as informative ifthe distance between the two queried points is 0 and uninformative otherwise.We ﬁrst start by proving a helper lemma. Lemma 15.

Assume that among the ﬁrst t queries of A , at most . k queriesinvolved the i -th point and all of those queries were uninformative. Then,conditioned on the ﬁrst t queries and the answer to those queries, the probabilitythat the i -th point belongs to some arbitrary ﬁxed small cluster with index j small ∈ [ k/ is at most zkn . Moreover, the statement still holds if we additionallycondition on an arbitrary cluster assignment of all points, except the i -th one,that is consistent with the oracle answers to the ﬁrst t queries.Proof. Let Q t denote the event corresponding to the observed interaction between A and the distance oracle restricted to the ﬁrst t queries. To prove the lemma, wewill not only condition on Q t , but we additionally ﬁx the randomness of all pointsexcept the i -th one in an arbitrary way. We denote the corresponding event by R − i and we assume that P[ Q t ∩ R − i ] >

0. Among the ﬁrst t queries, the i -th44oint was queried at most 0 . k times and all queries were uninformative. Hence,conditioned on Q t and R − i , there are at least 0 . k − . k = 0 . k big clustersthat the i -th point could still be contained in. Let j big denote an arbitrary indexof one of those big clusters. Using Bayes rule, we haveP[ E i,j small | Q t ∩ R − i ] ≤ P[ E i,j small | Q t ∩ R − i ]0 . k · P[ E i,j big | Q t ∩ R − i ]= P[ Q t ∩ R − i | E i,j small ]P[ E i,j small ]0 . k · P[ Q t ∩ R − i | E i,j big ]P[ E i,j big ]= P[ R − i | E i,j small ]P[ E i,j small ]0 . k · P[ R − i | E i,j big ]P[ E i,j big ]= P[ E i,j small ]0 . k P[ E i,j big ]= 100 z . k · ( n − z ) ≤ zkn . Let C A denote the set of centers that A outputs. To simplify the notation,we assume that C A is a set of k indices between 1 and n instead of k points. Wedeﬁne C A ,small = C A ∩ [ k/

2] as the set of indices corresponding to small clusters.The main ingredient of our lower bound argument is to show that E [ |C A ,small | ]is small, namely at most 0 . k . To that end, we partition C A ,small = C (cid:116) C (cid:116) C into three diﬀerent sets C := { i ∈ C A ,small : the i -th point was involved in less than 0 . k queries,all being uninformative } , C := { i ∈ C A ,small : the i -th point was involved in at least 0 . k queries,the ﬁrst 0 . k of those being uninformative } , C := { i ∈ C A ,small : Among the ﬁrst 0 . k queries the i -th point was involved with,at least one was informative } . We prove that the expected size of all three sets is small. We start bybounding E [ |C | ]. To that end, let Q denote the event corresponding to thecomplete observed interaction between A and the oracle. As A is deterministic,conditioned on Q , the set C A is completely determined. Consider some arbitrary i ∈ C A such that the i -th point was involved in at most 0 . k queries, all beinguninformative. Lemma 15 together with a union bound over all small clustersimplies thatP[the i -th point belongs to a small cluster | Q ] ≤ . k · zkn = 250 zn . (21)45ence, using linearity of expectation, we obtain E [ |C | ] ≤ k · zn ≤ . k .Next, we bound the expected size of C . Note that there can be at most2 · nk / (100000 z )0 . k = nk z points that are involved in at least 0 . k queries, the ﬁrst0 . k of those being uninformative. Among those points, the expected fraction ofpoints belonging to small clusters is at most zn . This is again a consequenceof Eq. (21), as at the moment the i -th point was queried exactly 0 . k times, allqueries being uninformative, the probability that the point belongs to a smallcluster is at most zn . Hence, we can conclude that E [ |C | ] ≤ zn · nk z = 0 . k .Thus, it remains to bound E [ |C | ]. To that end, for t ∈ [ T ], let X t denote theindicator variable for the following event: The t -th query of A is informative,involves a point that was queried less than 0 . k times before and all thoseprevious queries were uninformative. Again, we use Lemma 15 to obtain that E [ X t ] ≤ zk · n . As |C | ≤ · (cid:80) t ∈ T X t , linearity of expectation implies that E [ |C | ] ≤ nk z · zk · n ≤ . k. Putting things together, we obtain that E [ |C A ,small | ] ≤ . k . By Markov’sinequality, this implies that P( |C A ,small | ≥ . k ) ≤ /

3. As the expected size ofeach small cluster is zk ≥

200 log k , a Chernoﬀ Bound followed by a UnionBound implies that each small cluster contains at least 100 z/k points with aprobability of at least 0 .

9. Hence, with a probability of at least 0 . − / > . z/k points and A outputs at most 0 . k centers belonging to small clusters. As (0 . k − . k ) · zk > z , this impliesthat the clustering cost with respect to C A is not zero, unlike the clustering costof an optimal solution. This ﬁnishes the proof. Remark 3.

Note that the same construction can be used to give an Ω( k /ε ) lower bound for k -means algorithms that output a solution C with ϕ ( X, C ) = O ( ϕ ( X, C ∗ ) + εϕ ( X, µ ( X )) . This shows that the complexity of theAFK-MC algorithm [BLHK16a] is essentially the best possible. We omitdetails. F Distributed algorithms

In this section, we show how to adapt classical streaming algorithm of [GMM + k -means || to the setting with outliers byextension of Theorem 8. In Section F.1, we show two ways of, roughly speaking,distributing Theorem 8. Then, in Section F.2, we show how this set can be usedto get distributed algorithms that output (1 + ε ) z outliers and are O (1 /ε ) oreven O (1)-approximation.The most natural distributed model for the following algorithms is thecoordinator model. In this model, the data X = X (cid:116) X (cid:116) · · · (cid:116) X m is evenlysplit across m machines. In 1-round distributed algorithm such as Algorithm 5,the machines perform some computation on their data, and send the result to thecoordinator who computes the ﬁnal clustering. In t -round distributed algorithm46uch as Algorithm 6, there are t rounds of communication between machinesand the coordinator. F.1 Distributed algorithms with bicriteria guarantee

In this section, we start by presenting Algorithm 5, a simple variant of thealgorithm of [GMM + Algorithm 5

Overseeding from Guha et al. [GMM + Require data X , threshold OPT / (2 ε ) ≤ Θ ≤ OPT /ε split X arbitrarily into sets X , X , . . . , X m . for j ← , , . . . , m in parallel do Y j ← Algorithm 2 with X A = X j , (cid:96) A = ˜ O ( k/ε ), and Θ A = Θ. For each y ∈ Y j , let w ( y ) be the number of points x ∈ X j , τ Θ ( x, Y j ) < Θ,such that y = arg min y (cid:48) ∈ Y j τ Θ ( x, y (cid:48) ) with ties handled arbitrarily. Let X out,j be the set of points x ∈ X j with τ ( x, Y j ) = Θ. end for Each site j sends ( Y j , w ) , | (cid:83) j X out,j | to the coordinator, who computes theweighted set ( Y, w ) = (cid:83) mj =1 ( Y j , w ) and the cardinality of X out = (cid:83) j X out,j as | X out | = (cid:80) j | X out,j | .Here, by weighted set ( Y, w ) we mean a set Y = { y , y , . . . , y | Y | } togetherwith a weighting function w : { , , . . . , | Y |} → N . Set operations can be naturallyextended to the weighted setting and in the next subsection we observe that k -means algorithms too. Theorem 13.

Let OPT / (2 εz ) ≤ Θ ≤ OPT / ( εz ) . Suppose we run Algorithm 5on m machines with (cid:96) = O ( k/ε log( m/δ )) and let Y := (cid:83) j Y j . Then withprobability − δ we have m (cid:88) j =1 τ ( X j ∩ X ∗ in , Y j ) = O ( OPT ) . Proof.

Let z j = | X j | and recall C ∗ is the optimal solution for the whole instance X . By Theorem 8 we have, with probability at least 1 − m · δ/m = 1 − δ thatfor every jτ ( X j ∩ X ∗ in , Y j ) = O ( inf C, | C | = k τ ( X j ∩ X ∗ in , C )+ εz j Θ) = O ( τ ( X j ∩ X ∗ in , C ∗ )+ εz j Θ) . m (cid:88) j =1 τ ( X j ∩ X ∗ in , Y j ) = m (cid:88) j =1 O ( τ ( X j ∩ X ∗ in , C ∗ ) + εz j Θ)= O ( τ ( X ∗ in , C ∗ ) + εz Θ) = O (OPT) . We leave the description of the algorithm run by the coordinator to Section F.2and now we show a diﬀerent way of getting the weighted set (

Y, w ) with essentiallythe same guarantees, by adapting the k -means || algorithm. The k -means || algorithm We now show a simple adaptation of the k -means || algorithm [BMV + k -means ++ .In k -means || , the goal is to get ˜ O ( k ) centers in few distributed steps so as toachieve constant approximation guarantee. This is done by changingAlgorithm 2 as follows: in each round we sample k points instead of just onepoint proportional to the cost, but the algorithm is run only for O (log n ) steps.Our adaptation of k -means || algorithm to outliers is below. Algorithm 6 k -means || overseeding Require data X , t , sampling factor (cid:96) , thresholdΘ Uniformly sample x ∈ X and set Y = { x } . for i ← , , . . . , t do Y (cid:48) ← ∅ for x ∈ X do Add x to Y (cid:48) with probability min (cid:16) , (cid:96)τ Θ ( x,Y ) τ Θ ( X,Y ) (cid:17) end for Y ← Y ∪ Y (cid:48) end for For each y ∈ Y , let w ( y ) be the number of points x ∈ X j with τ Θ ( x, Y ) < Θand such that y = arg min y (cid:48) ∈ Y τ Θ ( x, y (cid:48) ) with ties handled arbitrarily. Let X out be the set of points x with τ ( x, Y ) = Θ. Return ( Y, w ) , | X out | We now prove an analogue of Theorem 13. Again, it can be seen as extensionof our basic result Theorem 8.

Theorem 14.

Let Θ ∈ [ OPT / (2 ε ) , OPT /ε ] be arbitrary. Suppose we runAlgorithm 6 for t = O (log( n/ε )) rounds with (cid:96) = O (( k/ε ) log n ) to obtain output Y . Then, with probability at least − / poly( n ) , we have that τ Θ ( X ∗ in , Y ) = O ( OPT ) . k -means || byRozhon [Roz20]. Proof.

Let Y ⊆ Y ⊆ · · · ⊆ Y t be the cluster centers that are gradually builtby Algorithm 6. We have Y = { y } for a point y picked uniformly at random.Whatever point we picked, we have τ ( X, Y ) ≤ n Θ = poly( n/ε ), as we assume∆ = poly( n ). We call a cluster X ∗ i unsettled in iteration j if τ ( X ∗ i , Y j ) > τ ( X ∗ i , C ∗ ). We deﬁne τ U ( X, Y j ) as τ U ( Y, Y j ) = (cid:88) i , X ∗ i unsettled in iter. j τ ( X ∗ i , Y j ) . We will now prove that with probability 1 − / poly( n ) we have τ U ( X, Y j +1 ) ≤ τ U ( X, Y j ) + OPT . (22)To prove Eq. (22), ﬁx an iteration j . If τ U ( X, Y j ) ≤ OPT, the claim certainlyholds, so assume the opposite. We split unsettled clusters into two groups: acluster X ∗ i is called heavy if τ ( X ∗ i , Y j ) ≥ τ U ( X, Y j ) / (2 k ) and light otherwise.The probability that we make a heavy cluster X ∗ i settled can be bounded asfollows. Sampling a random point from X ∗ i proportional to τ ( · , Y j ) makes X ∗ i settled with probability at least 1 / Z i ⊆ X ∗ i with τ ( Z i , Y j ) ≥ τ ( X ∗ i , Y j ) / Z i makes X ∗ i settled. If Z i contains a point x with τ ( x, Y j ) ≥ τ ( X,Y j ) (cid:96) , we sample x withprobability 1 and this also makes X ∗ i settled. Otherwise,P( X ∗ i does not get settled) ≤ (cid:89) x ∈ Z i (1 − (cid:96) · τ ( x, Y j ) /τ ( X, Y j )) ≤ exp( − O ( k/ε log n ) (cid:88) x ∈ Z i τ ( x, Y j ) /τ ( X, Y j )) 1 + x ≤ e x ≤ exp( − O ( k (log n ) /ε ) · τ ( X ∗ i , Y j ) /τ ( X, Y j )) ≤ exp( − O ((log n ) /ε ) · τ U ( X, Y j ) /τ ( X, Y j )) X ∗ i is heavy ≤ exp (cid:18) − O ((log n ) /ε ) · τ U ( X, Y j ) τ U ( X, Y j ) + 10OPT + z Θ (cid:19) each point is unsettled, settled, or an outlier ≤ exp (cid:18) − O ((log n ) /ε ) ·

110 OPT11OPT + OPT /ε (cid:19) τ U ( X, Y j ) ≥ OPT= 1 / poly( n ) . Hence, every heavy cluster X ∗ i gets settled with probability 1 − / poly( n ). Wenow condition on that event. 49ote that (cid:80) X ∗ i light τ ( X ∗ i , Y j ) ≤ k · τ U ( X, Y j ) / (2 k ) = τ U ( X, Y j ) /

2. Hence,we get τ U ( X, Y j ) − τ U ( X, Y j +1 ) ≥ (cid:88) X ∗ i heavy τ ( X ∗ i , Y j ) ≥ τ U ( X, Y j ) / t = O (log( n/ε )) times together with τ ( X, Y ) ≤ poly( n/ε ) to conclude that τ ( X, Y t ) ≤ (1 / t poly( n/ε ) + OPT · t (cid:88) j =1 (1 / j = O (OPT) , with probability 1 − / poly( n ). F.2 Constructing ﬁnal clustering

In this subsection, we show how the output of Algorithms 5 and 6, that is, aweighted set of centers (

Y, w ), together with the number of found outliers | X out | ,can be used to get ( O (1) , ε )-approximation algorithms, via Theorems 13and 14. First, in Theorem 15, we only show how to get( O (1 /ε ) , ε )-approximation guarantee by simply running Algorithm 4 on theweighted data with the parameter of number of outliers set to roughly z − | X out | . The advantage of Theorem 15 is its simplicity (we implement avariant of it in Section 7) and speed.Next, in Theorem 16 we reﬁne the reduction to get ( O (1) , ε )-approximationguarantee. This result is interesting from the theoretical perspective, as weexplained in Section 5. As a subroutine, the coordinator needs to run some( O (1) , ε )-approximation algorithm, i.e., linear programming based algorithms[Che08, KLS18], so the resulting distributed algorithm seems not practical. Also,we need to somewhat reﬁne Algorithms 5 and 6 so that they pass somewhatmore information than just ( Y, w ) and | X out | . Theorem 15 (Adaptation of [GMM + . Let ε ∈ [0 , be arbitrary and Θ besuch that OPT / (2 εz ) ≤ Θ ≤ OPT / ( εz ) . Let ( Y, w ) denote the weighted point setthat Algorithm 6 and Algorithm 5 compute, respectively. Furthermore, assumethat τ Θ ( X ∗ in , Y ) ≤ α OPT in case of Algorithm 6, and (cid:80) j τ Θ ( X j ∩ X ∗ in , Y j ) ≤ α OPT , Y = (cid:83) Y j , in case of Algorithm , respectively. Now, suppose we thenrun a ( β, ε ) -approximation algorithm A for the weighted k -means with outliersformulation on the weighted instance ( Y, w ) with z A := z − | X out | + 2 αεz outliersand let C denote the resulting set of k centers. Then, ϕ − (1+5 αε ) z ( X, C ) = O (( αβ + 1 /ε ) OPT ) . Here, we deﬁne the weighted k -means with outliers as the following problem:for the weighted input ( Y, w ), where each w ( y ) is an integer, we want to choosea set of k centers C and a weight vector w in ( y ) for any y such that w in ( y ) is a50on-negative integer, for w out ( y ) = w ( y ) − w in ( y ) we have (cid:80) y ∈ Y w in ( y ) ≤ z A and the sum (cid:80) y ∈ Y w in ( y ) ϕ ( y, C ) is minimized.Note that algorithms for k -means with outliers can be seen as algorithmsfor weighted k -means with outliers by viewing weighted points ( y, w ( y )) as w ( y )points on the same location (our assumption that pairwise distances of inputpoints are at least one, is in our case only for simplicity of exposition). Moreover,sampling based algorithms such as Algorithm 3 can be implemented in timeproportional not to (cid:80) y w ( y ), but | Y | = ˜ O ( k ) by additionally weighting thesampling distribution by the weight of input points.Plugging in our Algorithm 3 as A in Theorem 15 hence yields a distributedalgorithm with approximation factor O ( αβ + 1 /ε ) = O (1 /ε + 1 /ε ) = O (1 /ε )that outputs (1 + O ( ε )) z outliers. Moreover, the computational complexity ofthe coordinator is ˜ O ( k /ε ) if we use Algorithm 6 and ˜ O ( mk /ε ) if we useAlgorithm 5. Proof.

We start by showing that | X ∗ in ∩ X out | ≤ αεz . That is, the number ofpoints that Algorithm 5 and Algorithm 6, respectively, declare as outliers in theﬁrst phase of the algorithm, but that are actually true inliers with respect to agiven optimal solution is small. For Algorithm 6, we have | X ∗ in ∩ X out | ≤ τ Θ ( X ∗ in , Y )Θ ≤ α OPTOPT / (2 εz ) = 2 αεz, and for Algorithm 5, we have | X ∗ in ∩ X out | ≤ (cid:88) j τ Θ ( X j ∩ X ∗ in , Y j )Θ ≤ α OPTOPT / (2 εz ) = 2 αεz, as desired. In particular, | X out | = | X out ∩ X ∗ out | + | X out ∩ X ∗ in | ≤ z + 2 αεz .Hence, z A = z − | X out | + 2 αεz ≥ A is run on a legalinstance. Let X in := X \ X out . Notice that for Algorithm 6, we have ϕ ( X ∗ in ∩ X in , Y ) = τ Θ ( X ∗ in ∩ X in ) ≤ α OPT , (23)and for Algorithm 5, we have (cid:88) j ϕ ( X ∗ in ∩ X in ∩ X j , Y j ) = (cid:88) j τ Θ ( X ∗ in ∩ X in ∩ X j , Y j ) ≤ α OPT . (24)Let us call each set of points B ( y ) ⊆ X that Algorithm 5 or Algorithm 6groups around y ∈ Y a blob and for a given x ∈ X in , we denote with y x the pointwith x ∈ B ( y x ). We deﬁne (as the respective algorithm also does) w ( y ) = | B ( y ) | .Moreover, we deﬁne w ∗ in ( y ) = | B ( y ) ∩ X ∗ in | and w ∗ out ( y ) = | B ( y ) ∩ X ∗ out | .Note that splitting the weights w into w ∗ in and w ∗ out gives us a natural upperbound on the cost of the optimal solution on the instance Y with z A outliers:51n one hand, we have (cid:88) y ∈ Y w ∗ out ( y ) = | X ∗ out ∩ X in | (25)= | X ∗ out | − | X ∗ out ∩ X out | = z − | X out | + | X out ∩ X ∗ in |≤ z − | X out | + 2 αεz = z A . Thus, if we label the weighted set (

Y, w ∗ out ) as outliers, at most z A points arelabeled as outliers. On the other hand, we will now bound ϕ (( Y, w ∗ in ) , C ∗ ), where C ∗ is the optimum clustering for the original problem deﬁned on X .For any y ∈ Y , x ∈ X ∗ in ∩ B ( y ), and c ∗ x = arg min c ∗ ∈ C ∗ ϕ ( x, c ∗ ) we have ϕ ( y, C ∗ ) ≤ ϕ ( y, c ∗ x ) ≤ ϕ ( y, x ) + 2 ϕ ( x, c ∗ x ) , (26)where the second inequality is due to Fact 2. Hence, ϕ (( Y, w ∗ in ) , C ∗ ) = (cid:88) y ∈ Y w ∗ in ( y ) · ϕ ( y, C ∗ ) (27)= (cid:88) y ∈ Y (cid:88) x ∈ X ∗ in ∩ B ( y ) ϕ ( y, C ∗ ) ≤ (cid:88) y ∈ Y (cid:88) x ∈ X ∗ in ∩ B ( y ) ϕ ( y, x ) + 2 ϕ ( x, c ∗ x ) Eq. (26) ≤  (cid:88) x ∈ X ∗ in ∩ X in ϕ ( y x , x )  + 2  (cid:88) x ∈ X ∗ in ϕ ( x, c ∗ x )  ≤ α OPT + 2OPT Eqs. (23) and (24)= O ( α OPT) . Now we consider the output C A and w A out of A , i.e., for each y ∈ Y thealgorithm A decides which integer weight w A out of y is labeled as outlier. Deﬁne w A in ( y ) = w ( y ) − w A out ( y ). This solution of instance ( Y, w ) with (cid:80) y ∈ Y w A out ( y )outliers naturally deﬁnes a solution of the original instance X as follows. Weleave the set of centers C A the same and whenever w A out ( y ) is nonzero, we label w A out ( y ) arbitrary points in B ( y ) as outliers and call this set X A out (note that thealgorithm itself does not need to explicitly label the outliers, it suﬃces to provethat there is a labelling). Then we deﬁne points from X out ∪ X A out as outliers.The number of points we label as outliers is bounded by | X out | + (cid:88) y ∈ Y w A out ( y ) ≤ | X out | + (1 + ε ) z A (28)= | X out | + (1 + ε )( z − | X out | + 2 αεz ) ≤ (1 + 5 αε ) z X \ ( X out ∪ X A out ) with respect to C A . We have ϕ ( X \ ( X out ∪ X A out ) , C A )= (cid:88) y ∈ Y (cid:88) x ∈ B ( y ) \ X A out ϕ ( x, C A ) ≤ (cid:88) y ∈ Y (cid:88) x ∈ B ( y ) ϕ ( x, y x ) + 2 (cid:88) y ∈ Y (cid:88) x ∈ B ( y ) \ X Aout ϕ ( y x , C A ) Fact 2= 2 (cid:88) y ∈ Y (cid:88) x ∈ B ( y ) ∩ X ∗ in ϕ ( x, y x ) + 2 (cid:88) y ∈ Y (cid:88) x ∈ B ( y ) ∩ X ∗ out ϕ ( x, y x )+ 2 (cid:88) y ∈ Y (cid:88) x ∈ B ( y ) \ X Aout ϕ ( y x , C A ) ≤ α OPT + 2 ϕ ( X in ∩ X ∗ out , Y ) + 2 (cid:88) y ∈ Y w A in ϕ ( y, C A ) Eq. (23), Eq. (24) ≤ α OPT + 2 z Θ + 2 (cid:88) y ∈ Y w A in ϕ ( y, C A ) | X ∗ out | = z, x ∈ X in ⇒ ϕ ( x, Y ) ≤ Θ ≤ α OPT + 2(1 /ε )OPT + 2 βϕ (( Y, w ∗ in ) , C ∗ ) A is β approximation ≤ O ((1 /ε + αβ )OPT) . Eq. (27)

Better construction

Here we sketch a somewhat more elaborate constructionthat enables us to lose only a constant factor in approximation, instead of O (1 /ε ).Below, we assume existence of a ( β, ε )-approximation algorithm that works inwhat we call an almost-metric space , which is deﬁned as a classical metric space,but without the axiom d ( x, y ) = 0 ⇔ x = y . That is, even when an algorithmchooses some point x as a cluster center, the clustering cost of x might still benon-zero. Our algorithms work in this more general setting and we believe thatthe algorithm of [BERS19] too. In any case, note that the problem of k -meanswith outliers in almost metric space can be reduced to the same problem inmetric space by splitting each vertex x in 2 k clones with x , . . . , x k and deﬁning d (cid:48) ( x i , x j ) = d ( x, x ). This increases the number of points by a 2 k -factor andwe lose only a factor of 2 in our approximation guarantee. As in previousconstruction, we moreover assume a weighted version of the problem, when forinteger weight w ( x ) one is allowed to output an integer 0 ≤ w out ( x ) ≤ w ( x ) andthen only w ( x ) − w out ( x ) weight is used in the computation of the cost. Theorem 16.

Let ε ∈ (0 , be arbitrary. Suppose that there is a ( β, ε ) -approximation algorithm A that solves the weighted variant of k -means withoutliers in an almost-metric space in polynomial time. Then, there is a distributedpolynomial-time ( O ( β ) , O ( ε )) -approximation algorithm in the coordinatormodel with communication ˜ O ( k/ε ) per site that succeeds with positive constantprobability. roof. Assume that Θ satisﬁes

OPT2 εz ≤ Θ ≤ OPT εz . We later discuss how to”guess” OPT in this distributed setting. The algorithm needed is a variantAlgorithm 5, so we only describe the change that needs to be done. When thevariant of Algorithm 5 aggregates all points to the closest center in Y j , it will notjust compute for each y its weight w ( y ), i.e., for how many points in X j \ X out the point y is the closest, but it iterates over all these points x , x , . . . , x w ( y ) and rounds the distances d ( x i , y ) down to the closest power of two. Then, theadditional information sent to the leader about y is not just w ( y ), but also thelist w ( y ) , w ( y ) , w ( y ) , . . . , where w k ( y ) is the number of points x ∈ B ( y ) withdistance d ( x, y ) rounded down to 2 k . For any y ∈ Y j and k , we deﬁne B k ( y ) asthe set of points in X j \ X out for which y is the closest point in Y j and d ( x, y ) isrounded down to 2 k . As before, for a given x ∈ X in := X \ X out , y x denotes thepoint y for which x ∈ B ( y ).The instance ( Y (cid:48) , w (cid:48) ) the leader is going to construct will not be just ( Y, w ),but the leader creates a new almost-metric space M (cid:48) that includes the originalinput metric space M that is only deﬁned for points of Y together with theoriginal metric, but moreover, for each y and 0 ≤ k ≤ log ∆, it contains a point y k with d M (cid:48) ( y, y k ) = 2 k . Now, imagine a weighted graph with the set of verticesbeing equal to the points in the almost-metric space and there is an edge betweentwo points if we explicitly stated the distance between the two points and theweight of the edge is equal to that distance. Then, the distance between eachpair of vertices is just equal to the length of the shortest path between thetwo vertices in the weighted graph. In fact, we also deﬁne d M (cid:48) ( y k , y k ) = 2 · k (which is not possible in a metric space) and only then the leader runs the( β, ε )-approximation algorithm A in M (cid:48) . We note that the deﬁned distancessatisfy the triangle inequality.As in Theorem 15, we assume that (cid:80) j τ Θ ( X j ∩ X ∗ in , Y j ) ≤ α OPT. We run A on M (cid:48) (we will use ϕ (cid:48) and d (cid:48) when talking about (squared) distances in M (cid:48) ) withthe number of outliers being set to z A = z − | X out | + 2 αεz and get as output aset of centers C A and a set of outliers ( Y, w A out ) with (cid:80) y ∈ Y w A out ( y ) ≤ (1 + ε ) z A .As in the proof of Theorem 15, we would now like to argue that the optimalsolution in the almost-metric space has a cost of at most O ( α OPT) by consideringthe optimal set of centers C ∗ for the original problem. However, there is onesmall technical issue. Namely, the points in C ∗ might not be contained inthe almost-metric space. To remedy this situation, we do the following: Wenaturally extend the almost-metric space to also include all the points in C ∗ andprove that the deﬁned distances still satisfy the triangle inequality. Next, weshow that in this extended metric space, the cost of the optimal solution (withoutliers deﬁned appropriately) has a cost of O ( α OPT). By using the uniformsampling lemma (Lemma 3), this implies that there also exists a solution in theoriginal almost-metric space (without the points in C ∗ ) with a cost of at most4 · O ( α OPT) = O ( α OPT), as desired.We deﬁne w ∗ in ( y k ) = | B k ( y ) ∩ X ∗ in | and w ∗ out ( y k ) = | B k ( y ) ∩ X ∗ out | . We startby extending the almost-metric space to points in C ∗ . For each c ∗ ∈ C ∗ and y ∈ Y , we deﬁne d M (cid:48) ( y, c ∗ ) = d ( y, c ∗ ). That is, the distance between points in Y ∪ C ∗ is simply equal to the original distance. The distance between c ∗ ∈ C ∗ y k is deﬁned as d M (cid:48) ( c ∗ , y k ) = d M (cid:48) ( c ∗ , y ) + d M (cid:48) ( y, y k ) = d M (cid:48) ( c ∗ , y ) + 2 k .This extension still satisﬁes the triangle inequality, which more or less directlyfollows from the fact that the original metric satisﬁes the triangle inequality.Next, we deﬁne a solution for the weighted k -means with z A outliers objectivein this extended almost metric-space, whose cost gives us an upper bound of O ( α OPT) for the optimal solution. We consider the optimal set of centers C ∗ together with the set of outliers X ∗ out . We set w ∗ out ( y k ) = | B k ( y ) ∩ X ∗ out | and w ∗ in ( y k ) = w k ( y ) − w ∗ out ( y k ). Splitting the weights w into w ∗ in and w ∗ out togetherwith the set of centers C ∗ gives us a natural upper bound (up to a factor of 4,as mentioned before) on the cost of the optimal solution on the instance ( Y (cid:48) , w (cid:48) )with z A outliers. First, one needs to verify that the number of declared outliers (cid:80) y ∈ Y (cid:80) k w ∗ out ( y k ) in the solution is at most z A . This follows in the exact sameway as in the proof of Theorem 15 (cf. Eq. (25)). Thus, it remains to bound ϕ (cid:48) (( Y (cid:48) , w ∗ in ) , C ∗ ).Let x ∈ B k ( y ) be arbitrary for some k and y ∈ Y . Moreover, let c ∗ x :=arg min c ∗ ∈ C ∗ ϕ ( x, c ∗ ). We have ϕ (cid:48) ( y k , C ∗ ) ≤ ϕ (cid:48) ( y k , y ) + 2 ϕ (cid:48) ( y, C ∗ ) (29)= 2 ϕ (cid:48) ( y k , y ) + 2 ϕ ( y, C ∗ ) ≤ ϕ (cid:48) ( y k , y ) + 4 ϕ ( y, x ) + 4 ϕ ( x, c ∗ x ) ≤ ϕ ( y, x ) + 4 ϕ ( x, c ∗ x ) ϕ (cid:48) ( y k , y ) ≤ ϕ ( y, x ) , where we repeatedly used Fact 2. Hence, ϕ (cid:48) (( Y (cid:48) , w ∗ in ) , C ∗ ) = (cid:88) y k ∈ Y (cid:48) w ∗ in ( y k ) · ϕ (cid:48) ( y k , C ∗ ) (30)= (cid:88) y k ∈ Y (cid:48) (cid:88) x ∈ X ∗ in ∩ B k ( y ) ϕ (cid:48) ( y k , C ∗ ) ≤ (cid:88) y k ∈ Y (cid:48) (cid:88) x ∈ X ∗ in ∩ B k ( y ) ϕ ( y, x ) + 4 ϕ ( x, c ∗ x ) Eq. (29) ≤  (cid:88) x ∈ X ∗ in ∩ X in ϕ ( y x , x )  + 4  (cid:88) x ∈ X ∗ in ϕ ( x, c ∗ x )  ≤ O ( α OPT) + 4OPT= O ( α OPT) . Now we consider the output C A and w A out of A , i.e., for each y k ∈ Y (cid:48) the algorithm A decides which integer weight w A out ( y k ) of y k is labeled as anoutlier. Deﬁne w A in ( y k ) = w k ( y ) − w A out ( y k ). This solution of instance ( Y (cid:48) , w ) in M (cid:48) with (cid:80) y k ∈ Y (cid:48) w A out ( y k ) outliers naturally deﬁnes a solution of the originalinstance X with | X out | + (cid:80) y ∈ Y w A out ( y ) outliers: for each center c ∈ C A in M (cid:48) we deﬁne f ( c ) ∈ M as follows: if c = y k for some y and k , we let f ( c ) = y .Otherwise, f ( c ) = y for some y and we simply set f ( c ) = c . Now, we choose55 ( C A ) := { f ( c ) : c ∈ C A } ⊆ Y ⊆ X as our set of centers. Moreover, whenever w A out ( y k ) is nonzero, we label w A out ( y k ) arbitrary points in B k ( y ) as outliers andwe call the resulting ste of outliers X A out . Finally, we deﬁne all the points in X out ∪ X A out as outliers. Note that we have (cid:80) y j ∈ Y (cid:48) w A out ( y j ) ≤ (1 + ε ) z A , so | X out ∪ X A out | ≤ | X out | + (1 + ε ) z A = | X out | + (1 + ε )( z − | X out | + αεz )= (1 + O ( αε )) z which means that in total we label only (1 + O ( αε )) z vertices as outliers, asdesired. We now bound the cost of the set X \ ( X out ∪ X A out ) with respect to theset of centers f ( C A ).To that end, we note that for an arbitrary x ∈ B k ( y ) and any c ∈ M (cid:48) wehave ϕ (cid:48) ( y k , c ) = (2 k + d (cid:48) ( y, c )) and therefore ϕ ( x, f ( c )) ≤ ( d ( x, y ) + d ( y, f ( c ))) (31) ≤ (2 k +1 + d ( y, f ( c ))) x ∈ B k ( y ) ≤ k + d ( y, f ( c ))) ≤ k + d (cid:48) ( y, c )) Deﬁnition of f and d (cid:48) = 4 ϕ (cid:48) ( y k , c ) . Hence, we have ϕ ( X \ ( X out ∪ X A out ) , f ( C A )) (32)= (cid:88) y k ∈ Y (cid:48) (cid:88) x ∈ B k ( y ) \ X A out ϕ ( x, f ( C A )) ≤ (cid:88) y k ∈ Y (cid:48) (cid:88) x ∈ B k ( y ) \ X A out ϕ (cid:48) ( y k , C A ) Eq. (31)= 4 (cid:88) y k ∈ Y (cid:48) w A in ( y k ) ϕ (cid:48) ( y k , C A ) ≤ β · ϕ (cid:48) (( Y (cid:48) , w ∗ in ) , C ∗ ) A is β apx= O ( αβ OPT) . Eq. (30)Note that assuming

OPT2 εz ≤ Θ ≤ OPT εz , Theorem 13 implies that (cid:80) j τ Θ ( X j ∩ X ∗ in , Y j ) = O (OPT) with positive constant probability andtherefore we can assume that α = O (1). This implies that our ﬁnal solution hasa cost of O ( αβ OPT) = O ( β )OPT and moreover the solution outputs at most(1 + O ( αε )) z = (1 + O ( ε )) z outliers, as desired. What remains to be discussed ishow to remove the assumption that the algorithm knows OPT. To that end, weagain run the algorithm for O (log( n ∆)) guesses of OPT, namely for all values2 e with e ∈ [log( n ∆)] in parallel as before. Each machine will send all O (log n )respective weighted sets ( Y (cid:48) , w ) and | X out,j | to the coordinator that outputs the56et of candidate centers that minimizes ϕ (cid:48) (( Y (cid:48) , w A in ) , C A ) among those where X out satisﬁes | X out | ≤ z . Eq. (32) certiﬁes that, with positive constantprobability, we get an O ( ββ