[PDF] A Constant Approximation Algorithm for Sequential Random-Order No-Substitution k-Median Clustering

Abstract

Full PDF

aa r X i v : . [ c s . L G ] F e b A Constant Approximation Algorithmfor Sequential No-Substitution k -Median Clusteringunder a Random Arrival Order Tom Hess Michal Moshkovitz Sivan Sabato

Department of Computer Science, Ben Gurion University, Beer-Sheva, Israel Qualcomm Institute, University of California, San Diego, USA [email protected] , [email protected] , [email protected] Abstract

We study k -median clustering under the sequential no-substitution setting. In this setting,a data stream is sequentially observed, and some of the points are selected by the algorithmas cluster centers. However, a point can be selected as a center only immediately after it isobserved, before observing the next point. In addition, a selected center cannot be substitutedlater. We give a new algorithm for this setting that obtains a constant approximation factoron the optimal risk under a random arrival order. This is the ﬁrst such algorithm that holdswithout any assumptions on the input data and selects a non-trivial number of centers. Thenumber of selected centers is quasi-linear in k . Our algorithm and analysis are based on acareful risk estimation that avoids outliers, a new concept of a linear bin division, and repeatedcalculations using an oﬄine clustering algorithm. Clustering is a fundamental unsupervised learning task used for various applications, such asanomaly detection (Leung and Leckie, 2005), recommender systems (Shepitsen et al., 2008) andcancer diagnosis (Zheng et al., 2014). In recent years, research on sequential clustering has beenactively studied, motivated by applications in which data arrives sequentially, such as online rec-ommender systems (Nasraoui et al., 2007) and online community detection (Aggarwal, 2003).In this work, we study k -median clustering in the sequential no-substitution setting, a termﬁrst introduced in Hess and Sabato (2020). In this setting, a stream of data points is sequentiallyobserved, and some of these points are selected by the algorithm as cluster centers. However, a pointcan be selected as a center only immediately after it is observed, before observing the next point.In addition, a selected center cannot be substituted later. This setting is motivated by applicationsin which center selection is mapped to a real-world irreversible action, such as providing users withpromotional gifts or recruiting participants to a clinical trial.The goal in the no-substitution k -median setting is to obtain a near-optimal k -median riskvalue, while selecting a number of centers that is as close as possible to k . For a stream in anadversarial arrival order, it has been shown (Moshkovitz, 2019) that in the worst case, all points1ust be selected as centers in order to obtain an approximation factor that does not depend onthe stream length. On the other hand, for a random arrival order, there exist algorithms withguarantees that are independent of the stream length (Hess and Sabato, 2020; Moshkovitz, 2019).However, Hess and Sabato (2020) assume a bounded metric space, while the approximation factorin Moshkovitz (2019) is exponential in k .We provide a new algorithm for sequential no-substitution k -median clustering that obtains aconstant approximation factor on the optimal risk under a random arrival order. This is the ﬁrstsuch algorithm that requires no additional assumptions on the input data set and selects a non-trivial number of centers. The number of centers selected by the algorithm is only quasi-linear in k . The guarantee is provided with a high probability over the random stream order.Our algorithm uses as a black box an oﬄine k -median approximation algorithm with a con-stant approximation factor. Several such eﬃcient algorithms exist, e.g., Charikar et al. (2002);Li and Svensson (2016). If the oﬄine algorithm is eﬃcient, then so is our algorithm.Our analysis relies on several new techniques. First, we employ a careful method for estimatingthe optimal risk from a stream preﬁx. This is challenging, since small clusters in the optimalsolution can be poorly represented in the preﬁx. To overcome this issue, we estimate the risk whilecarefully ignoring outliers. We also provide a new set construction of a linear bin division , whichallows a high-probability association of the risks of stream subsets, while selecting a number ofcenters independent of the stream length. This improves a previous construction of Meyerson et al.(2004). Main results.

Our main algorithm,

RepSelect , obtains a constant approximation factor withprobability − δ , while selecting only O ( k log ( k/δ )) centers. To deﬁne RepSelect , we ﬁrst providea simpler algorithm,

EstSelect , which uses risk estimation to select centers.

EstSelect alreadyobtains a constant approximation factor, but selects ˜ O ( k /δ ) centers. This is improved in RepSelect by repeatedly calling

EstSelect as a black box with suitable parameters. We show that this reducesthe number of selected centers without changing the approximation factor.While

RepSelect selects a smaller number of centers, we believe that

EstSelect may also be ofinterest as a stand-alone algorithm, since it is very fast: it only runs the oﬄine black-box algorithmonce, on a small fraction of the input stream. The guarantees for

EstSelect and

RepSelect areprovided in our main theorems, Theorem 4.1 and Theorem 4.2. Table 1 compares the guaranteesof

RepSelect to previous works; See Section 2 for additional details.

Paper structure.

We discuss related work in Section 2. The formal setting and notations aredeﬁned in Section 3. The algorithms and their guarantees are given in Section 4. In Section 5, westate the core theorem that is used to prove the guarantees, and provide an overview of the proof.Proofs of most lemmas are deferred to appendices. We conclude with a discussion in Section 6.

Several works have studied settings related to the no-substitution k -median clustering setting. Ta-ble 1 summarizes the upper bounds mentioned below.First, some works studied related settings under an adversarial arrival order. Liberty et al.(2016) studied online k -means clustering, in which centers are sequentially selected, and each ob-served point is allocated to one of the previously selected centers. The proposed algorithm can2efs arrival assumptions approximation numberorder factor of centersLSS16 adversarial bounded aspect ratio O (log( n )) O ( k log ( n )) BR20 adversarial bounded aspect ratio constant O ( k log ( n )) BM20 adversarial data properties O ( k ) O ( k log( k ) log( n )) Mo19 random none exponential in k O ( k log( n )) Mo19 random none exponential in k O ( k ) HS20 random bounded diameter constant + additive k Ours random none constant O ( k log ( k )) Table 1: Comparing our guarantees to previous works. n is the stream length. Abbre-viations: LSS16: Liberty et al. (2016), BR20: Bhaskara and Rwanpathirana (2020), BM20:Bhattacharjee and Moshkovitz (2020) Mo19: Moshkovitz (2019), HS20: Hess and Sabato (2020).be applied to the no-substitution setting, yielding an approximation factor of O (log( n )) , where n is the stream length. The number of selected centers in this algorithm depends on the streamlength and also on the aspect ratio of the input data. Bhaskara and Rwanpathirana (2020) stud-ied the same setting as Liberty et al. (2016), improving the approximation factor to a constant.Bhattacharjee and Moshkovitz (2020) explicitly studied the no-substitution setting, and providedan approximation factor of O ( k ) that depends on properties of the input data set. Moshkovitz(2019) showed that under an adversarial arrival order, obtaining an approximation factor that doesnot depend on the stream length requires selecting all the stream points as centers in the worstcase.The case of a random arrival order has also been studied in recent works. Moshkovitz (2019)obtained an approximation factor that is exponential in k , with a number of centers that is linearin k , but also logarithmic in the stream length n . This work further shows that if the streamlength is not known in advance and the approximation factor does not depend on n , then one mustselect Ω(log( n )) centers. Indeed, all the works mentioned above consider a stream with an unknownlength and select a number of centers which depends polylogarithmically on the stream length.Assuming a known stream length, Moshkovitz (2019) obtained an approximation factor whichis exponential in k but independent of the stream length, while selecting O ( k ) centers. Thisguarantee holds with a constant probability. Hess and Sabato (2020) proposed an algorithm thatselects exactly k centers. They obtained a constant approximation factor, with an additionaladditive term that vanishes for large streams, under the assumption that the metric space has abounded diameter. Their guarantee is with high probability. To date, no algorithm is known forthe no-substitution k -median setting that obtains a constant approximation factor under a randomarrival order without additional assumptions on the input data set, while selecting a non-trivialnumber of centers.A related sequential clustering setting is the streaming setting (e.g., Guha et al., 2000; Ailon et al.,2009; Chen, 2009; Ackermann et al., 2012; Braverman et al., 2016), in which the main limitation isthe amount of memory available to the algorithm. This setting allows substituting selected centers,but algorithms in this setting can be used in the no-substitution setting, by collecting all the centers3ver selected. However, we are not aware of any algorithm in this setting that has a competitivebound on the total number of selected centers. For an integer i , denote [ i ] := { , . . . , i } . Let ( X, ρ ) be a ﬁnite metric space, where X is a setof size n and ρ : X × X → R + is a metric. For a point x ∈ X and a set T ⊆ X , we denote ρ ( x, T ) := min y ∈ T ρ ( x, y ) . For an integer k ≥ , a k -clustering of X is a set of (at most) k pointsfrom X which represent cluster centers. Throughout this work, whenever an item is selected basedon minimizing ρ , we assume that ties are broken based on a ﬁxed arbitrary ordering. Given a set S ⊆ X , the k -median risk of T on S is R ( S, T ) := P x ∈ S ρ ( x, T ) . The k -median clustering problem aims to select a k -clustering T of X with a minimal overall risk R ( X, T ) . We denote by OPT anoptimal solution to this problem:

OPT ∈ argmin T ⊆ X, | T |≤ k R ( X, T ) .In the sequential no-substitution k -median setting that we study, X is not known a priori. Thepoints from X are presented to the algorithm sequentially, in a random order. We assume that thestream length, n , is provided as input to the algorithm; see Section 6 for a discussion on supportingan unknown stream length. The algorithm may select an observed point as a center only beforeobserving the next point. Any selected point cannot later be removed or substituted. The goal ofthe algorithm is to select a small number of centers, such that with a high probability, the overallrisk of the selected set on the entire X approximates the optimal k -median risk, R ( X, OPT) .An oﬄine k -median algorithm A is an algorithm that takes as input a ﬁnite set of points S and the parameter k , and outputs a k -clustering of S , denoted A ( S, k ) . For β ≥ , we saythat A is a β -approximation oﬄine k -median algorithm on ( X, ρ ) , if for all input sets S ⊆ X , R ( S, A ( S, k )) ≤ β · R ( S, OPT S ) , where OPT S is an optimal solution on S . Formally, OPT S ∈ argmin T ⊆ S, | T |≤ k R ( S, T ) .Throughout this work, we set lengths of sub-streams using calculations which may lead to non-integers values. For simplicity of presentation, we do not explicitly round these non-integers. Theeﬀect of such rounding on the analysis is negligible for reasonably large streams. In this section, we describe the two proposed algorithms,

EstSelect and

RepSelect , and statetheir guarantees. We start in Section 4.1 with

EstSelect , which obtains a constant approximationfactor and selects ˜ O ( k /δ ) centers. Then, in Section 4.2, we describe the full algorithm, RepSelect ,which uses

EstSelect as a black box and obtains the same approximation factor as

EstSelect ,but selects only O ( k log ( k/δ )) centers. EstSelect

The ﬁrst algorithm that we present is

EstSelect . The core challenge addressed in this algorithm isthe use of a small preﬁx of the stream to obtain a reliable estimate of the optimal risk on the entirestream. The usefulness of such an estimate for center selection was observed, e.g., in Liberty et al.(2016). Under a random arrival order, one may hope that obtaining a good estimate would bestraight-forward. However, since the metric space is unbounded, even a small number of outlierscan bias the risk estimate considerably. We overcome this challenge by showing that it suﬃces to4 lgorithm 1

EstSelect input δ ∈ (0 , , k ∈ N , n ∈ N (stream length), StreamRead (the stream access oracle); A (an oﬄine k -median algorithm);Technical parameters: α ∈ (0 , ] , γ ∈ ( α, − α ] , M ∈ N , τ > . △ Phase 1 (Calculate an initial clustering T α ) : P ← the ﬁrst αn points from StreamRead . T α := { c , . . . c k + } ← A ( P , k + ) △ Only calculation, no centers are actually selected here. △ Phase 2 (Estimate the risk of T α on large optimal clusters) : P ← next αn points from StreamRead . ψ ← α R α ( k +1) φ α ( P , T α ) △ ψ estimates the risk of T α on large optimal clusters . △ Phase 3 (Select centers) : ∀ i ∈ [ k + ] , n i ← , Near i ← FALSE . △ n i counts selected points associated with c i . Near i indicates if a point close to c i was selected. for γn iterations do Read the next point x from StreamRead . i ← argmin i ∈ [ k + ] ρ ( x, c i ) . △ Find the closet center to x in T α . if ρ ( x, c i ) > ψkτ or ¬ Near i or n i ≤ M then Select x as a center. △ This is the only line in which actual selections occur. n i ← n i + 1 . If ρ ( x, c i ) ≤ ψkτ then Near i ← TRUE . end if end for estimate the optimal risk not including outliers, and that this risk can be estimated reliably froma preﬁx of the stream. Our analysis is based on a distinction between small and large optimalclusters, which we formally deﬁne later. EstSelect is listed as Alg. 1. It uses the following notation. For two sets

S, T ⊆ X and aninteger r , denote by far r ( S, T ) ⊆ S the set of r points in S that are the furthest from T accordingto the metric ρ . If | S | < r , we deﬁne far r ( S, T ) := S and call far r ( S, T ) a trivial far set . Denoteby R r ( S, T ) := R ( S \ far r ( S, T ) , T ) the risk of T on S after discounting the r points that incur themost risk.Let k + := k + 38 log(32 k/δ ) . For a real value α > , let φ α := α log(32 k/δ ) . EstSelect receives as input the value of k , a conﬁdence parameter δ ∈ (0 , , the total streamlength n , and access to a sequential oracle StreamRead , which provides the points of the streamone by one. We further assume access to some black-box oﬄine k -median algorithm A . Theguarantees of EstSelect depend on the approximation factor obtained by A . EstSelect alsorequires several technical parameters, denoted α, γ, M and τ . These parameters will have ﬁxedvalues when EstSelect is used as a standalone algorithm, but will later vary when

EstSelect isused as a black box in

RepSelect . We explain the meaning of these parameters in the propercontext below.

EstSelect works in three phases. For i ∈ [3] , we denote by P i the set of points that are readin phase i . In each of the ﬁrst two phases, an α fraction of the points in the stream is read. In thelast phase, a γ fraction of the points in the stream is read. When EstSelect is run as a standalonealgorithm, γ is set to − α , so that P ∪ P ∪ P = X . The ﬁrst two phases are used for calculations.Centers are selected only during the third phase.5n the ﬁrst phase, a reference clustering T α is calculated. The centers in T α cannot be selectedas centers by EstSelect , since the oﬄine algorithm calculates T α only after observing all the points.In the second phase, an estimate ψ of the risk of T α is calculated. In the third phase, EstSelect observes points from the stream one by one, deciding for each one whether it should be selected asa center. For each observed point,

EstSelect ﬁrst ﬁnds the center c i ∈ T α which is closest to it.A point is then selected as a center if it satisﬁes one of three conditions: (1) its distance from c i ismore than ψ/ ( kτ ) (where τ is one of the input technical parameters); (2) c i does not yet have M associated points (where M is one of the technical parameters); or (3) no point close to c i has beenselected so far, as maintained by the Boolean variable Near i .The following theorem provides the guarantee for EstSelect , when running as a standalonealgorithm with a speciﬁc setting of the technical parameters. The proof is provided in Section 5.

Theorem 4.1.

Let k, n ∈ N . Let δ ∈ (0 , . Let ( X, ρ ) be a ﬁnite metric space of size n . Let T out be the set of centers selected by EstSelect for the input parameters δ, k, n and the technicalparameters α := δ/ (4 k ) , γ := 1 − α, M := log(8 k + /δ ) , τ := φ α . Suppose that StreamRead outputsa random permutation of X , and that A is a β -approximation oﬄine k -median algorithm. Then,with a probability at least − δ ,1. | T out | = O ( k δ log( kδ )) , and2. R ( X, T out ) ≤ Cβ · R ( X, OPT) , where C > is a universal constant. We sketch here some of the main ideas of the analysis of

EstSelect . Consider the clustersinduced by some ﬁxed optimal k -median clustering on X . Call these clusters optimal clusters .Optimal clusters that are large (that is, include a suﬃciently large fraction of X ) are easy toidentify from a small random subset of points. Therefore, approximate centers for these clusterswill be identiﬁed in the ﬁrst phase of EstSelect by the black box algorithm A , and will be includedin T α . Thus, the risk of T α on the subset of points that belong to large optimal clusters will beclose to optimal. Note that T α is a clustering with k + > k centers. The reason for this is thefollowing. The data set might include outliers which are not proportionally represented in eachphase. Searching for a k -clustering in P might thus lead to a clustering that is too biased towardssuch outliers. By increasing the size of the solution searched in the ﬁrst phase to k + , this allowsthe solution to include the outliers as centers, while still choosing also centers that are close to theoptimal centers of all large clusters. The selection conditions of the third phase further guaranteethat at least M points are selected near each center c i ∈ T α which represents a large cluster, thusmaking sure that a center which is very close to each such c i is selected. This allows bounding theobtained risk on points in large optimal clusters.The remaining challenge is to make sure that points in small optimal clusters are not too farfrom the set T out of all selected centers. Since the metric space is unbounded, even a single suchpoint can destroy the constant approximation factor. To overcome this, EstSelect selects a pointnear each center in T α , as well as all the points that are far from T α . The threshold ψ/ ( kτ ) , whichdeﬁnes a point as far, is set so that the number of points selected from large optimal clusters canbe bounded. This bound crucially relies on the accuracy of ψ as an estimate for the risk of T α onlarge clusters. The required accuracy is obtained by ignoring the points that are furthest from T α when calculating the risk on P (see line 6). For small clusters, they hold a small number of pointsby deﬁnition, thus the number of such points selected as centers is also bounded.6 lgorithm 2 RepSelect input δ ∈ (0 , , k ∈ N , n ∈ N (stream length), StreamRead (the stream access oracle); A (an oﬄine k -median algorithm). α ← δ/ (4 k ) ; For i ∈ N , α i ← α · i − . I ← ⌊ log (1 / (6 α )) ⌋ , δ ′ ← δ/ ( I + 1) . △ I is set so that α I +1 ∈ (1 / , / . Prepare I + 1 copies of EstSelect , indexed by i ∈ [ I + 1] , as follows:In all copies, use inputs k, n, A , τ ← φ α I +1 , and set the conﬁdence parameter to δ ′ .In copy i ∈ [ I + 1] , set α ← α i .In all but the last copy, set M ← and γ ← α i .In the last copy (index I + 1 ), set M ← log(8 k + /δ ′ ) , and γ ← − α I +1 . Read each point from

StreamRead and feed it to each of the copies of

EstSelect .If any of the copies selected the point, then select it as a center.

RepSelect

So far, we showed how to obtain a constant approximation factor using a number of centers whichis polynomial in k and does not depend on the stream length n , without any assumptions on X .This already improves over previous results, as can be seen in Table 1. Nonetheless, the quadraticdependence on k and the linear dependence on /δ in the number of selected centers are largerthan one would hope. We now provide the improved algorithm RepSelect , which runs

EstSelect several times as a black box, and selects signiﬁcantly fewer centers.

RepSelect is listed in Alg. 2. Similarly to

EstSelect , RepSelect receives as input the valueof k , the conﬁdence level δ ∈ (0 , , the total stream length n , and access to a sequential oracle StreamRead which provides the points of the stream one by one. We also similarly assume accessto some black-box oﬄine k -median algorithm A . RepSelect runs several copies of

EstSelect on the same input stream: each point is read from

StreamRead and then fed to each of the copies of

EstSelect . Each of these copies selects somecenters, and the set of all selected centers is the solution of

RepSelect . The diﬀerence between thecopies is in the value of the technical input parameters. First, the value of α is progressively doubled,starting with α ≡ δ/ (4 k ) and ending with α I +1 ≡ α · I , where I is set so that α I +1 ∈ (1 / , / .The value of τ in all copies is set to φ α I +1 , based on the largest value of α . In all but the last copy, γ is set to α i , where i is the copy index. This means that the ﬁrst and second phase of copy i are each of size α i , while the third phase is of size α i (the rest of the stream is ignored by copy i ). Therefore, copy i observes a α i = 2 α i +1 fraction of the stream. In other words, the ﬁrst twophases of copy i + 1 exactly overlap with the fraction of the stream observed by copy i . Figure 1illustrates the overlap of phases in consecutive copies of EstSelect .In the last copy, indexed by I + 1 , γ is set to − α I +1 , thus this copy reads the entire stream.It can be seen that each point, except for the ﬁrst nα points, participates in the third phase ofexactly one copy of EstSelect .As our analysis below shows, this overlap between the phases guarantees that the set of centersselected by the copies of

EstSelect obtains a small risk on all the small optimal clusters. The lastcopy is slightly diﬀerent: in addition to setting the length of the third phase γ to include all thepoints of the stream, it also sets the technical parameter M to a number larger than . We showthat in this way, the selected set of centers obtains a small risk on large optimal clusters as well. RepSelect obtains the same approximation factor as

EstSelect , but selects a smaller number7f centers. This is stated in the following theorem, which is proved in Section 5.

Theorem 4.2.

Let k, n ∈ N . Let δ ∈ (0 , . Let ( X, ρ ) be a ﬁnite metric space of size n . Let T out bethe set of centers selected by RepSelect for the input parameters δ, k, n . Suppose that

StreamRead outputs a random permutation of X , and that A is a β -approximation oﬄine k -median algorithm.Then, with a probability at least − δ ,1. | T out | ≤ O (cid:0) k log ( k/δ ) (cid:1) , and2. R ( X, T ) ≤ Cβ · R ( X, OPT) , where

C > is the same universal constant as in Theorem 4.1. RepSelect reduces the number of selected centers by overcoming the following limitation of

EstSelect : On the one hand, the ﬁrst phase and the second phase must be small, otherwise wemight miss a point that needs to be selected, since no points are selected in these two phases. Onthe other hand, small ﬁrst and second phases lead to a poor quality of the reference clustering T α .The phase sizes of the diﬀerent copies of EstSelect make sure that all points except for a small α fraction participate in some selection phase, while at the same time improving the quality of T α as α grows. Our analysis below shows that in this way, the number of centers selected by eachcopy remains similar, even though larger values of α lead to a larger selection phase.In the next section, we state a core theorem, use it to prove the theorems stated above for EstSelect and

RepSelect , and give an overview of its proof.

In this section, we give an overview of our proof of Theorem 4.1 and Theorem 4.2. First, inSection 5.1, we present our core theorem, Theorem 5.1, which is used to prove both theorems.Then, in Section 5.2, we introduce the concept of linear bin divisions and derive useful properties.In Section 5.3, we state and discuss the main lemmas used in the proof of Theorem 5.1, and usethem to prove the theorem. Proofs of lemmas are provided in the appendices.

In this section, we state our core theorem, which is used for proving Theorem 4.1 and Theorem 4.2.The core theorem provides a guarantee for

EstSelect under general values of the technical inputparameters. First, we deﬁne some notation. Recall that

OPT is an optimal k -median clustering for X . Denote the centers in OPT by c ∗ , . . . , c ∗ k . Denote by C ∗ , . . . C ∗ k the clusters induced by OPT ,deﬁned as C ∗ i := { x ∈ X | i = argmin j ∈ [ k ] ρ ( c ∗ j , x ) } . Recall that ties are broken based on a ﬁxedarbitrary ordering. We call { C ∗ i } the optimal clusters . α α α P P P P P P P (selection) P (selection) P (selection)2 α α α α α α Figure 1: Illustrating the phases in copies , , of EstSelect within

RepSelect .8e deﬁne α -small and α -large optimal clusters as follows: all optimal clusters that have fewerthan φ α points are considered α -small. All other optimal clusters are considered α -large. Formally,denote the indices of the α -small clusters by I α small := { i ∈ [ k ] | | C ∗ i | < φ α } and the indices of the α -large optimal clusters by I α large := [ k ] \ I α small . Denote the set of points in α -small optimal clustersby C α small := S i ∈ I α small C ∗ i and their complement by C α large := S i ∈ I α large C ∗ i . We now present the coretheorem. Theorem 5.1.

Let k, n ∈ N . Let δ ∈ (0 , . Let ( X, ρ ) be a ﬁnite metric space of size n . Let T out be the set of centers selected by EstSelect for the input parameters δ, k, n and for technicalparameters α ∈ (0 , / , γ ∈ ( α, − α ] , M ∈ N , τ > . Suppose that StreamRead outputs arandom permutation of X , and that A is a β -approximation oﬄine k -median algorithm. Then,with a probability at least − δ/ ,1. | T out | = O (cid:0) γα k log( k/δ ) + kτ + ( k + log( k/δ )) M (cid:1) ;2. For any i ∈ [ k ] , if c ∗ i ∈ P , then R ( C ∗ i , T out ) ≤ R ( C ∗ i , { c ∗ i } ) + | C ∗ i | kτ (36 β + 20) R ( X, OPT); (1)

3. If γ = 1 − α , τ = φ α and M = log(8 k + /δ ) , then R ( C α large , T out ) ≤ R ( C α large , OPT) + (468 β + 260) R ( X, OPT) . (2)Using Theorem 5.1, the proofs of Theorem 4.1 and Theorem 4.2 are relatively straight forward.The second claim of the theorem is used to bound the risk on small optimal clusters, while the thirdclaim is used to bound the risk of large optimal clusters. The proofs of both theorems are providedin Appendix A. In the next section, we deﬁne the concept of a linear bin division, which is used inthe proof of Theorem 5.1. In this section, we provide essential preliminaries for the proof of Theorem 5.1. A main tool inthis proof is partitioning sets of points from the input stream into subsets, such that each of thesubsets is probably well represented in a relevant random subset of the stream. Meyerson et al.(2004) deﬁned the concept of a bin division of X with respect to a clustering T , which partitions X into bins of equal size, with the points allocated to bins based on their distance from T . Here, wedeﬁne the concept of a linear bin division , in which the bins linearly increase in size. This gradualincrease allows keeping the size of the ﬁrst bin independent of the stream length, while still provingthat with a high probability, the overlap of each of the bins in the division with random subsetsof the stream is close to expected. The ﬁxed size of the ﬁrst bin is crucial for deriving guaranteesthat are independent of the stream length. In addition, the ratio between adjacent bin sizes is keptbounded, which allows proving a bounded approximation factor. Deﬁnition 5.2 (Linear bin division) . Let

W, T ⊆ X be ﬁnite sets. Let z be an integer. A z -linearbin division of W with respect to T is a partition B ≡ ( B (1) , . . . , B ( L )) of W (for some integer L )such that the following properties hold: . If z ≤ | W | , then ∀ i ∈ [ L ] , |B ( i ) | ≥ z · ( i + 1) / . Otherwise, the bin division is called trivial ,and deﬁned as B := B (1) := W .2. |B (1) | ≤ z .3. ∀ i ∈ [ L − , |B ( i + 1) | / |B ( i ) | ≤ / .4. ∀ i ∈ [ L − , and ∀ x ∈ B ( i ) , x ′ ∈ B ( i + 1) , it holds that ρ ( x, T ) ≥ ρ ( x ′ , T ) . A linear bin division exists for any size of W : For | W | ≤ z , the conditions trivially hold. For | W | ≥ z , the three ﬁrst properties hold for the following allocation of bin sizes: Let L be the largestinteger such that B := P i ∈ [ L ] z · ( i + 1) / ≤ | W | . Set the size of B ( i ) to z · ( i + 1) / | W | − B ) /L .Property 1 clearly holds. Property 2 holds since ( | W | − B ) /L ≤ ( z ( L + 2) / /L ≤ z/ . Property3 holds since ( i + 2 + a ) / ( i + 1 + a ) ≤ / for all i ≥ and any non-negative a . To satisfy property4, allocate the elements of W into the bins in descending order of their distance from T .We say that a set is well-represented in another set if the size of its overlap with the set is similarto expected. This extends naturally to bin divisions. Formally, this is deﬁned as follows. Deﬁnition 5.3 (Well-represented) . Let W be a ﬁnite set and let A, B ⊆ W . We say that B is well-represented in A for W if | B ∩ A | / | B | ∈ [ r/ , (3 / r ] , where r := | A | / | W | . We say that a linearbin division B of W is well represented in A for W if each bin in B is well represented in A for W . The following lemma shows that if A is selected uniformly at random from W , then any ﬁxedset B is well-represented in A with a high probability. Moreover, the same holds for any z -linearbin division of W with a suﬃciently large z . The proof of the lemma is provided in Appendix B. Lemma 5.4.

Let W be a ﬁnite set. Let B ⊆ W be a set, and let B be a z -linear bin division ofsome subset of W with respect to some T , for some integer z . Let A ⊆ W be a set of size r | W | selected uniformly at random from W . Then the following hold:1. With a probability at least − e − r | B | , B is well-represented in A for W .2. If z ≥

10 log(4 /δ ) /r , then with a probability at least − δ , B is well represented in A for W . We now state and prove a main property of linear bin divisions, which will allow us to usesub-streams of the input stream to bound the risk of a clustering on X . Lemma 5.5.

Let W ⊆ X and let z ∈ N . Let B be a z -linear bin division of W with respect to some T . Let A ⊆ W and r ∈ (0 , . If ∀ i ∈ [ L ] , |B ( i ) ∩ A | / |B ( i ) | ≤ r , then R ( A \ B (1) , T ) ≤ rR ( W, T ) . Proof.

Let L be the number of bins in B . We ﬁrst prove the following inequality, which relates therisk of the intersection of a bin with A to the risk of the preceding bin. ∀ i ∈ [ L − , R ( A ∩ B ( i + 1) , T ) ≤ rR ( B ( i ) , T ) . (3)To prove Eq. (3), ﬁx i ∈ [ L − , and denote b := max x ∈B ( i +1) ρ ( x, T ) . By the assumptions, we have | A ∩ B ( i + 1) | ≤ r |B ( i + 1) | . Hence, R ( A ∩ B ( i + 1) , T ) ≤ r |B ( i + 1) | · b. By property 3 of linear bindivisions, |B ( i + 1) | ≤ |B ( i ) | . Therefore, R ( A ∩ B ( i + 1) , T ) ≤ r |B ( i ) | · b .In addition, by property 4 of linear bin divisions and the deﬁnition of b , ∀ x ∈ B ( i ) , b ≤ ρ ( x, T ) . Therefore, R ( B ( i ) , T ) ≥ |B ( i ) | · b . 10ombining the two inequalities, we get Eq. (3). It follows that: R ( A \ B (1) , T ) = L X i =2 R ( A ∩ B ( i ) , T ) ≤ r L − X i =1 R ( B ( i ) , T ) ≤ rR ( W, T ) . This proves the statement of the lemma.To prove Theorem 5.1, we deﬁne an event, denoted E , which holds with a high probability. Weprove each of the claims in the theorem under the assumption that this event holds. To deﬁne theevent, we ﬁrst provide some necessary notation.Consider a run of EstSelect with some ﬁxed set of input parameters, and assume that

StreamRead outputs a random permutation of the points in X . Recall that T α = { c , . . . , c k + } is the clusteringcalculated in the ﬁrst phase of EstSelect . Denote the clusters induced by T α on X by C , . . . , C k + ,where C i := { x ∈ X | i = argmin j ∈ k + ρ ( c j , x ) } . We deﬁne several sets and bin divisions, which willbe used to deﬁne the required event. Let B a be a ( φ α / -linear bin division of X with respect to OPT . Deﬁne F b := far kφ α ( X \ P , T α ) , and let B b be a ( φ α / -linear bin division of X \ ( P ∪ F b ) with respect to T α . Deﬁne F c := far k +1) φ α ( X \ P , T α ) , and let B c be a ( φ α / -linear bin divisionof X \ ( P ∪ F c ) with respect to T α . Lastly, deﬁne F d := far k +1) φ α ( X \ P , T α ) . Note that each ofthese objects may be trivial if the stream is small.We now deﬁne the event E .1. For each i ∈ I α large , C ∗ i is well-represented in P and in P ∪ P for X .2. Each of F b and F c is trivial or well-represented in P for X \ P ; F d is trivial or well-representedin P for X \ P .3. B a is trivial or well-represented in P for X ; Each of B b and B c is trivial or well-representedin P for X \ P ;4. For each i ∈ [ k + ] , one of the ﬁrst log (8 k + /δ ) points observed from C i in P is closer to c i than at least half of the points in C i ∩ P .The following lemma, proved in Appendix C, shows that E holds with a high probability. Lemma 5.6.

Consider a run of

EstSelect with ﬁxed input parameters. Assume that

StreamRead outputs the points of X in a random order. Then E holds with a probability at least − δ/ . In this section, we give an overview of the proof of Theorem 5.1. We state the main lemmas, andthen use them to prove the theorem. The lemmas are proved in Appendix D.In the ﬁrst phase of

EstSelect , the clustering T α is calculated. The following lemma boundsthe risk obtained by T α on points in α -large optimal clusters. Note that for a larger α , the guaranteeis stronger, since C α large is larger. This motivates the use of diverse α values in RepSelect . Lemma 5.7. If E holds, then R ( C α large , T α ) ≤ (18 β + 10) R ( X, OPT) . In the second phase of

EstSelect , ψ ≡ α R α ( k +1) φ α ( P , T α ) is calculated. ψ estimates the riskof T α on points in α -large optimal clusters, by calculating the risk of T α after removing outliers,11here outliers are the points that are furthest from T α in P . To see why this allows estimating therisk on large optimal clusters, recall that α -small optimal clusters are smaller than φ α , thus theirtotal size is less than kφ α . It follows that R kφ α ( X, T α ) ≤ R ( X \ C α small , T α ) ≡ R ( C α large , T α ) . (4) ψ is calculated by removing α ( k + 1) φ α outliers instead of kφ α , to account the smaller size of P and its randomness. The following lemma shows that ψ is indeed upper bounded by the risk of T α on large optimal clusters. Lemma 5.8. If E holds, then ψ ≤ R ( C α large , T α ) . In the third phase, all points that are further than ψ/ ( kτ ) from T α are selected as centers (inaddition to some other points). The following lemma upper bounds the number of selected centersthat satisfy this condition. Denote N := |{ x ∈ P | ρ ( x, T α ) > ψ/ ( kτ ) }| . Lemma 5.9. If E holds, then N ≤ k + 1) φ α γ − α + 9 kτ . Lastly, the two lemmas given below allow us to bound the risk of the set of selected centers T out on points in all optimal clusters. The ﬁrst lemma considers optimal clusters whose center isobserved in the third phase. This lemma will be used to bound the obtained risk on small optimalclusters. Lemma 5.10.

For any i ∈ [ k ] , if c ∗ i ∈ P then R ( C ∗ i , T out ) ≤ R ( C ∗ i , { c ∗ i } ) + 2 | C ∗ i | ψ/ ( kτ ) . The second lemma bounds the risk of the set of selected centers on large optimal clusters.

Lemma 5.11. If E holds, γ = 1 − α , τ = φ α and M = log(8 k + /δ ) , then R ( C α large , T out ) ≤ R ( C α large , OPT) + 26 R ( C α large , T α ) . Based on the lemmas above, the proof of Theorem 5.1 is straight-forward.

Proof of Theorem 5.1.

By Lemma 5.6, E holds with probability − δ/ . Assume that E holds. First,we upper bound | T out | . Note that EstSelect selects a point only if one of the three conditions in line12 hold. By Lemma 5.9, the number of points selected by

EstSelect due to the ﬁrst condition is N ≤ k + 1) φ α γ − α + 9 kτ . Since φ α = O (log( k/δ ) /α ) and α ≤ / , we have N = O (( γα k log( k/δ ) + kτ ) .The rest of the points selected by EstSelect are those satisfying one of the other conditions of line12. This allows at most M additional points for each i ∈ [ k + ] . Thus, the total number of pointsselected by EstSelect is O ( γα k log( k/δ ) + kτ + k + · M ) . Since by deﬁnition, k + = O ( k + log( k/δ )) ,the ﬁrst claim of Theorem 5.1 immediately follows.The second claim of Theorem 5.1 follows immediately, by combining Lemma 5.7, Lemma 5.8,and Lemma 5.10. The third claim is also immediate, by combining Lemma 5.7 and Lemma 5.11.All the lemmas stated above are proved in Appendix D, thus completing the proof of Theo-rem 5.1. 12 Discussion

We provided the ﬁrst algorithms for sequential no-substitution k -median clustering for the case ofa random arrival order, which obtain a constant approximation factor on the optimal risk withoutany assumptions on the input data. Moreover, the number of centers selected by RepSelect isquasi-linear in k . It is interesting to consider the possibility of running EstSelect or RepSelect with a bounded memory. Both of these algorithms require a memory size of only ˜ O ( k log(1 /δ )) ontop of requirements of the black-box algorithm. Thus, any bounded-memory streaming algorithm A can be used to derive an analogous bounded-memory no-substitution algorithm. In addition, onemay want to apply these algorithms without the assumption that the stream length n is known. Asimple doubling trick can be used for this purpose. In this case, the constant approximation factorwould be preserved, but the number of selected centers would multiply by Θ(log( n )) . As discussedin Section 2, this logarithmic dependence on n is unavoidable if the stream length is unknown. References

M. R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen, and C. Sohler.StreamKM++: A clustering algorithm for data streams.

Journal of Experimental Algorithmics(JEA) , 17:2–4, 2012.C. C. Aggarwal. A framework for diagnosing changes in evolving data streams. In

Proceedings ofthe 2003 ACM SIGMOD international conference on Management of data , pages 575–586. ACM,2003.N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In

Advances in neuralinformation processing systems , pages 10–18, 2009.A. Bhaskara and A. K. Rwanpathirana. Robust algorithms for online k -means clustering. In Algorithmic Learning Theory , pages 148–173, 2020.R. Bhattacharjee and M. Moshkovitz. No-substitution k-means clustering with adversarial order. arXiv preprint arXiv:2012.14512 , 2020.V. Braverman, D. Feldman, and H. Lang. New frameworks for oﬄine and streaming coreset con-structions. arXiv preprint arXiv:1612.00889 , 2016.M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys. A constant-factor approximation algorithmfor the k-median problem.

Journal of Computer and System Sciences , 65(1):129–149, 2002.K. Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and theirapplications.

SIAM Journal on Computing , 39(3):923–947, 2009.D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence.

BRICS ReportSeries , 3(25), 1996.S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. In

Foundations ofcomputer science, 2000. proceedings. 41st annual symposium on , pages 359–366. IEEE, 2000.T. Hess and S. Sabato. Sequential no-substitution k-median-clustering. In

International Conferenceon Artiﬁcial Intelligence and Statistics , pages 962–972. PMLR, 2020.13. Joag-Dev and F. Proschan. Negative association of random variables with applications.

TheAnnals of Statistics , pages 286–295, 1983.K. Leung and C. Leckie. Unsupervised anomaly detection in network intrusion detection usingclusters. In

Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38 , pages 333–342. Australian Computer Society, Inc., 2005.S. Li and O. Svensson. Approximating k-median via pseudo-approximation.

SIAM Journal onComputing , 45(2):530–547, 2016.E. Liberty, R. Sriharsha, and M. Sviridenko. An algorithm for online k-means clustering. In ,pages 81–89. SIAM, 2016.A. Meyerson, L. O’callaghan, and S. Plotkin. A k-median algorithm with running time independentof data size.

Machine Learning , 56(1-3):61–87, 2004.M. Moshkovitz. Unexpected eﬀects of online k-means clustering. arXiv preprint arXiv:1908.06818 ,2019.O. Nasraoui, J. Cerwinske, C. Rojas, and F. Gonzalez. Performance of recommendation systems indynamic streaming environments. In

Proceedings of the 2007 SIAM International Conference onData Mining , pages 569–574. SIAM, 2007.A. Shepitsen, J. Gemmell, B. Mobasher, and R. Burke. Personalized recommendation in socialtagging systems using hierarchical clustering. In

Proceedings of the 2008 ACM conference onRecommender systems , pages 259–266. ACM, 2008.B. Zheng, S. W. Yoon, and S. S. Lam. Breast cancer diagnosis based on feature extraction using ahybrid of k-means and support vector machine algorithms.

Expert Systems with Applications , 41(4):1476–1482, 2014. 14

Proofs of Theorem 4.1 and Theorem 4.2

We now use Theorem 5.1 to prove the two theorems presented in Section 4. First, deﬁne thefollowing event, which we denote by G : ∀ i ∈ [ k ] , c ∗ i ∈ X is not observed in the ﬁrst n · δ/ (2 k ) reads from StreamRead .Observe that G holds with a probability at least − δ/ : For each c ∗ i , the probability that it isobserved in the preﬁx is δ/ (2 k ) , and a union bound over k centers gives a probability of at most δ/ that the event does not hold. We use this observation in the proofs below. First, we prove theguarantees of EstSelect . Proof of Theorem 4.1.

Theorem 5.1 and the event G hold simultaneously with a probability − δ .We henceforth assume that they both hold.To prove the ﬁrst part of the theorem, we bound | T out | by substituting the technical parametersof EstSelect in the ﬁrst part of Theorem 5.1. Set the parameters α := δ/ (4 k ) , γ := 1 − α = O (1) , M := log(8 k + /δ ) = O (log( k/δ )) and τ := φ α = O (log( k/δ ) /α ) = O (( k/δ ) log( k/δ )) . It then easilyfollows that | T out | = O (( k /δ ) log( k/δ )) , as claimed.We now prove the second part of the theorem, which bounds the risk approximation factor of T out on X . Since α = δ/ (4 k ) , we have | P ∪ P | = n · α = n · δ/ (2 k ) . In addition, since γ = 1 − α ,we have P = X \ ( P ∪ P ) . Hence, by the event G , ∀ i ∈ [ k ] , c ∗ i ∈ P . Thus, by the second part ofTheorem 5.1, Eq. (1) holds. We apply Eq. (1) to α -small optimal clusters, as follows: R ( C α small , T out ) = X i ∈ I α small R ( C ∗ i , T out ) ≤ X i ∈ I α small (cid:18) R ( C ∗ i , { c ∗ i } ) + | C ∗ i | kτ (36 β + 20) R ( X, OPT) (cid:19) ≤ R ( C α small , OPT) + (36 β + 20) R ( X, OPT) . (5)The last inequality follows since ∀ i ∈ I α small , | C ∗ i | ≤ φ α = τ , and | I α small | ≤ k .Now, we have γ := 1 − α , τ = φ α and M := log(8 k + /δ ) . Therefore, by the third part ofTheorem 5.1, Eq. (2) holds. Combining Eq. (2) with Eq. (5), and noting that X = C α small ∪ C α large ,the second part of Theorem 4.1 immediately follows.Next, we prove the guarantees of RepSelect . Proof of Theorem 4.2.

First, we show that we can assume that Theorem 5.1 holds simultaneouslyfor all copies of

EstSelect run in

RepSelect . The number of copies of

EstSelect is I + 1 . SinceAlg. 2 calls EstSelect with the conﬁdence parameter δ ′ , each of the executions of EstSelect satisﬁes the claims of Theorem 5.1 with a probability at least − δ ′ / . Since δ ′ ≡ δ/ ( I + 1) , by aunion bound, Theorem 5.1 holds for all executions simultaneously with a probability at least − δ/ .With a probability at least − δ , this holds simultaneously with the event G deﬁned above. Wehenceforth assume that all these events hold. Note that since α = δ/ (4 k ) , we have I + 1 ≡ ⌊ log (1 / (6 α )) ⌋ + 1 = O (log( k/δ )) . Therefore, δ ′ = Ω( δ/ log( k/δ )) and log( k/δ ′ ) = O (log( k/δ )) .To prove the ﬁrst part of Theorem 4.2, which upper bounds T out , we apply the ﬁrst claim ofTheorem 5.1 (with δ ′ ), which states that | T out | = O ( γα k log( k/δ ′ ) + kτ + ( k + log( k/δ ′ )) M ) , to eachof the copies of EstSelect . In all the copies of

EstSelect , we have τ := φ α I +1 = O (log( k/δ ′ )) = O (log( k/δ )) .

15n addition, in all but the last copy of

EstSelect , we have M := 1 and γ/α := 2 . Thus, the numberof centers selected in each of these copies of EstSelect is O ( k log( k/δ )) . There are I = O (log( k/δ )) such copies, hence the total number of centers selected in all but the last copy is O ( k log ( k/δ )) .In the last copy, EstSelect is called with M = O (log( k + /δ ′ )) = O (log( k/δ )) and γ = 1 − α I +1 .Since α I +1 > / , we have γ/α I +1 = O (1) . Therefore, again by the ﬁrst claim of Theorem 5.1,the number of centers selected by the last copy is O ( k log( k/δ ) + log ( k/δ )) . The overall numberof centers selected in RepSelect is thus O ( k log ( k/δ )) .We now prove the second part of Theorem 4.2, which bounds the approximation factor of T out on X . Since G holds, none of the centers c ∗ i , for i ∈ [ k ] , appears in the ﬁrst α n points read from StreamRead . Now, each point observed after this preﬁx appears in the selection phase of somecopy of

EstSelect in RepSelect . It follows that for each i ∈ [ k ] , Eq. (1) in the second part ofTheorem 5.1 holds for one of the copies.We apply Eq. (1) noting that in all copies, we have τ := φ α I +1 . In addition, all copies get thefull stream X as input. Moreover, the set of centers T out selected by RepSelect is a superset ofeach of the sets selected in each of the calls to

EstSelect , thus its risk cannot be larger. It followsthat for all i ∈ [ k ] , R ( C ∗ i , T out ) ≤ R ( C ∗ i , { c ∗ i } ) + | C ∗ i | kφ α I +1 (36 β + 20) R ( X, OPT) . In particular, for i ∈ I α I +1 small , we have | C ∗ i | ≤ φ α I +1 . Therefore, R ( C α I +1 small , T out ) = X i ∈ C αI +1small R ( C ∗ i , T out ) ≤ R ( C α I +1 small , OPT) + (36 β + 20) R ( X, OPT) . (6)Now, consider the last copy of EstSelect . In this copy, the conditions of the third claimof Theorem 5.1 hold, and so Eq. (2) holds. Combining Eq. (2) with Eq. (6), and noting that X = C α I +1 small ∪ C α I +1 large , the second part of Theorem 4.2 follows. B Proof of Lemma 5.4

In this section, we prove Lemma 5.4, stated in Section 5.2, which provides guarantees for theproperty of being well-represented.

Proof of Lemma 5.4.

For the ﬁrst claim, we use a multiplicative concentration bound of Dubhashi and Ranjan(1996). This bound states that for negatively associated random variables Z , . . . , Z n taking valuesin { , } , where Z := P i ∈ [ n ] Z i , it holds that P [ | Z − E [ Z ] | ≥ ǫ E [ Z ]] ≤ e − ǫ E [ Z ]2+ ǫ . Let Z i be equal to if the i ’th element of B is in A . By Joag-Dev and Proschan (1983), therandom variables of a uniform sample without replacement are negatively associated, hence theinequality above holds for { Z i } . In addition, Z = | B ∩ A | and E [ Z ] = r | B | . The ﬁrst claim followsby setting ǫ := 1 / . 16o prove the second claim, note that the ﬁrst claim implies that each set B ( i ) in B is well-represented in A for W with a probability at least − − r |B ( i ) | / . By property 1 of linearbin divisions, and the assumption that z ≥

10 log(4 /δ ) /r , we have |B ( i ) | ≥ z ( i + 1) / ≥ i + 1) log(4 /δ ) /r. Therefore, exp( − r |B ( i ) | / ≤ ( δ/ ( i +1) / . By a union bound over all the bins in B , the probability that at least one bin in B is not well-represented in A for W is upper-bounded by P ∞ i =1 ( δ/ ( i +1) / ≤ δ −√ δ ≤ δ , which proves theclaim. C Proof of Lemma 5.6

In this section, we prove Lemma 5.6, which states that the event E holds with high probability. Proof of Lemma 5.6.

First, we show that part 1 of E holds with a probability at least − δ/ . Thisis proved by applying the ﬁrst part of Lemma 5.4 to each of the clusters C ∗ i for i ∈ I α large , and notingthat their size is at least φ α . Applying the lemma with the sets P , X and with the sets P ∪ P , X and using a union bound over at most k large optimal clusters, it follows that the probability thatpart 1 does not hold is at most k exp( − αφ α /

10) = 4 k exp(15 log( δ/ (32 k )) = 4 k · ( δ/ (32 k )) ≤ δ/ .For part 2, we also show that it holds with a probability at least − δ/ . we similarly apply theﬁrst part of Lemma 5.4 to each of the required sets using the ratios | P | / | X \ P | = α/ (1 − α ) ≥ α and | P | / | X \ P | = γ/ (1 − α ) ≥ α (since γ ≥ α ). Since F b , F c , F d , if not trivial, are each of sizeat least kφ α , we get that the probability that any of these sets is not well represented is at most − αφ α / ≤ δ/ (32 k )) ≤ δ/ .Next, we show that part 3 holds with a probability at least − δ/ . For B a , if it is nottrivial, then the conditions of the second part of Lemma 5.4 hold with conﬁdence parameter δ/ ,since it is a z -linear bin division with z = φ α / ≥

10 log(32 /δ ) /α , and α = r , where r is theratio used in Lemma 5.4. For B b and B c , if they are not trivial, the conditions of the secondpart of Lemma 5.4 with conﬁdence parameter δ/ hold, since they are z -linear bin divisions with z = φ α / k/δ ) /α ≥

10 log(4 · /δ ) /r .Lastly, we show that part 4 holds with a probability at least − δ/ . Fix i ∈ [ k + ] . If | C i ∩ P | ≤ log (8 k + /δ ) , then the statement of part 4 trivially holds. Assume that | C i ∩ P | > log (8 k + /δ ) .Points from C i ∩ P are observed in a random order in the stream. The fraction of points in C ∩ P that are closer to c i than at least half of the points in C ∩ P is at least half. Therefore, everydraw from C i ∩ P has a probability of at least half of satisfying this requirement, conditioned onall previous points not satisfying this requirement. The probability that none of the points out ofthe ﬁrst log (8 k + /δ ) satisﬁes this requirement is thus at most −⌈ log (8 k + /δ ) ⌉ ≤ δ/ (8 k + ) . A unionbound over i ∈ [ k + ] gives the desired bound.By a union bound over all the parts of the event, it hold with a probability at least − δ/ . D Proving the main lemmas of Theorem 5.1

In this section, we prove all the lemmas stated in Section 5.3.17 .1 Bounding the risk of T α on large clusters In this section, we prove Lemma 5.7, which bounds R ( C α large , T α ) .For a given x ∈ X , denote the center closest to it in T α by c α ( x ) . Recall that B a is a ( φ α / -linear bin division of X with respect to OPT . To prove Lemma 5.7, we ﬁrst bound the risk of T α on large clusters, using an expression that depends on the optimal risk and on the points in theﬁrst phase of EstSelect . Lemma D.1. If E holds and n ≥ φ α / , then R ( C α large , T α ) ≤ R ( C α large , OPT) + 4 α ( R ( P \ B a (1) , OPT) + R ( P , T α )) . Proof.

For any x ∈ X and any i ∈ [ k ] , we have ρ ( x, c α ( x )) ≤ ρ ( x, c α ( c ∗ i )) ≤ ρ ( x, c ∗ i ) + ρ ( c ∗ i , c α ( c ∗ i )) . Therefore, R ( C α large , T α ) = X i ∈ I α large X x ∈ C ∗ i ρ ( x, c α ( x )) ≤ X i ∈ I α large X x ∈ C ∗ i ρ ( x, c ∗ i ) + X i ∈ I α large X x ∈ C ∗ i ρ ( c ∗ i , c α ( c ∗ i )) ≤ R ( C α large , OPT) + X i ∈ I α large | C ∗ i | · ρ ( c ∗ i , c α ( c ∗ i )) . (7)We now bound ρ ( c ∗ i , c α ( c ∗ i )) . Deﬁne A i := C ∗ i ∩ P \ B a (1) (we show below that A i = ∅ ). Fromthe deﬁnition of c α , we have ∀ x ∈ A i , ρ ( c ∗ i , c α ( c ∗ i )) ≤ ρ ( c ∗ i , c α ( x )) ≤ ρ ( c ∗ i , x ) + ρ ( x, c α ( x )) . Hence, ρ ( c ∗ i , c α ( c ∗ i )) ≤ | A i | X x ∈ A i ( ρ ( c ∗ i , x ) + ρ ( x, c α ( x ))) = 1 | A i | ( R ( A i , { c ∗ i } ) + R ( A i , T α )) . Denote l i := | C ∗ i | and l ′ i := | A i | . Then, by Eq. (7), R ( C α large , T α ) ≤ R ( C α large , OPT) + X i ∈ I α large l i l ′ i ( R ( A i , { c ∗ i } ) + R ( A i , T α )) . (8)We now upper bound l i /l ′ i . Since n ≥ φ α / , we have that B a (1) is non-trivial. Since E holds, B a (1) and { C ∗ i } i ∈ I α large are well-represented in P for X . In addition, | P | = α | X | . Hence, l ′ i ≡ | C ∗ i ∩ P \ B a (1) | ≥ | C ∗ i ∩ P | − |B a (1) ∩ P | ≥ α | C ∗ i | / − (3 / α |B a (1) | . (9)By the deﬁnition of α -large optimal clusters, for all i ∈ I α large , φ α ≤ | C i ∗ | ≡ l i . Since B a is a ( φ α / -linear bin division of X , by property 2 of linear bin divisions, we have |B a (1) | ≤ · φ α = φ α / ≤ l i / .Hence, by Eq. (9), l ′ i ≥ αl i / − (3 / α · ( l i / ≥ αl i / . Therefore, l i /l ′ i ≤ /α . Note that this alsoimplies that A i is non-empty. Thus, from Eq. (8), we have18 ( C α large , T α ) ≤ R ( C α large , OPT) + 4 α (cid:16) R ( P ∩ C α large \ B a (1) , OPT)+ R ( P ∩ C α large \ B a (1) , T α ) (cid:17) . Since P ∩ C α large \ B a (1) ⊆ P ∩ B a (1) , this leads to the statement of the lemma.In the next lemma, we bound R ( P , T α ) , which appears in the RHS of the inequality given inLemma D.1. In the proof of this lemma, the importance of selecting k + centers for T α instead ofonly k becomes apparent, as it allows deriving an upper bound on the risk of T α that disregardsthe outliers in B (1) . Lemma D.2. If E holds and n ≥ φ α / , then R ( P , T α ) ≤ βR ( P \ B a (1) , OPT) .Proof.

Denote the optimal k + -median solution for P with centers from the entire X by OPT k + P := argmin T ⊆ X, | T |≤ k + R ( P , T ) . Denote the optimal k + -median solution for P with centers from P by ] OPT k + P := argmin T ⊆ P , | T |≤ k + R ( P , T ) . It is well known (see, e.g., Guha et al., 2000) that for any set P in any metric space ( X, ρ ) , therisk ratio between the solution that uses only centers from P and the solution that uses the entiremetric space is bounded by : R ( P , ] OPT k + P ) ≤ R ( P , OPT k + P ) . Now, the black-box algorithm A which outputs T α in the ﬁrst phase of EstSelect is a β -approximation oﬄine algorithm. Therefore,we have R ( P , T α ) ≤ βR ( P , ] OPT k + P ) ≤ βR ( P , OPT k + P ) . (10)We bound R ( P , OPT k + P ) by the risk of the clustering deﬁned by OPT , with additional centersthat are the furthest from

OPT in P . Recall that B a is a ( φ α / -linear bin division of X withrespect to OPT . Deﬁne A := B a (1) ∩ P . By property 4 of linear bin divisions, the points in A arethe furthest from OPT in P . We now bound the size of A . Since E holds and n ≥ φ α / , B a (1) isnon-trivial and well represented in P for X . Therefore, | A | ≤ α |B a (1) | . In addition, by property2 of linear bin divisions, |B a (1) | ≤ · φ α ≤ φ α / . Therefore, | A | ≤ αφ α / ≤

38 log(32 k/δ ) . It follows that | OPT ∪ A | ≤ k + 38 log(32 k/δ ) ≡ k + . Thus, by the deﬁnition of OPT k + P , R ( P , OPT k + P ) ≤ R ( P , OPT ∪ A ) = R ( P \ A, OPT ∪ A )= R ( P \ B a (1) , OPT ∪ A ) ≤ R ( P \ B a (1) , OPT) . Combining the inequality above with Eq. (10), we get the statement of the lemma.Lemma D.1 and Lemma D.2 can be combined to get R ( C α large , T α ) ≤ R ( C α large , OPT) + 4 α (2 β + 1) R ( P \ B a (1) , OPT) . (11)The proof of Lemma 5.7 is now almost immediate.19 roof of Lemma 5.7. We distinguish between two cases based on the stream length n . First, con-sider the case n < φ α / . In this case, all optimal clusters are smaller than φ α / , thus they areall α -small optimal clusters. Hence, C α large = ∅ , and so R ( C α large , T α ) = 0 , which trivially satisﬁesthe required upper bound.Now, suppose that n ≥ φ α / . Thus, B a is non-trivial. Since E holds, it follows that B a iswell-represented in P for X . Hence, ∀ i ∈ [ L ] , | P ∩ B ( i ) | ≤ α |B ( i ) | . We now apply Lemma 5.5with W := X, T := OPT , A := P , B := B a , and r := α , we have R ( P \ B a (1) , OPT) ≤ αR ( X, OPT) . Combining this with Eq. (11), we get R ( C α large , T α ) ≤ R ( X, OPT) + 9(2 β + 1) R ( X, OPT) ≤ (18 β + 10) R ( X, OPT) , As claimed.

D.2 Upper-bounding the risk estimate ψ In Lemma 5.7, we showed that the risk of T α on large optimal clusters is bounded by a constantfactor over the optimal risk. In the second phase, EstSelect calculates an estimate of this risk,deﬁned as ψ := α R α ( k +1) φ α ( P , T α ) . In this section, we prove Lemma 5.8, which shows that theremoval of α ( k + 1) φ α outliers from the risk estimate indeed guarantees that ψ is upper boundedby the risk of T α on large clusters. Proof of Lemma 5.8.

We distinguish between two cases, based on the size of the stream. If n < k + 1) φ α , then | P | < α ( k + 1) φ α . In this case, R α ( k +1) φ α ( P , T α ) = 0 , thus ψ = 0 and theupper bound trivially holds.Now, consider the case n ≥ k + 1) φ α . In this case, | X \ P | ≥ (1 − α ) n ≥ ( k + 1) φ α . Recallthat F b ≡ far kφ α ( X \ P , T α ) , and B b is a ( φ α / -linear bin division of ( X \ P ) \ F b with respectto T α . F b is non-trivial, since | X \ P | ≥ kφ α . Similarly, B b is non-trivial, since | ( X \ P ) \ F b | ≥ ( k + 1) φ α − kφ α ≥ φ α / .Eq. (4), given in Section 5.3, states that R kφ α ( X, T α ) ≤ R ( C α large , T α ) . Hence, to prove thelemma, it suﬃces to prove that ψ ≤ R kφ α ( X, T α ) . In the rest of the proof we derive this inequality.We ﬁrst show that (F b ∪ B b (1)) ∩ P ⊆ far α ( k +1) φ α ( P , T α ) . This would imply that the pointsin F b ∪ B b (1) do not contribute to the risk R α ( k +1) φ α ( P , T α ) , which is used in the calculation of ψ . Since E holds, F b and B b (1) (which are both non-trivial) are well-represented in P for X \ P .We also have | P | / | X \ P | = α/ (1 − α ) ≤ α , where the last inequality follows since α ≤ / .Therefore, | F b ∩ P | ≤ · α | F b | = 2 αkφ α , and similarly, |B b (1) ∩ P | ≤ α |B b (1) | ≤ αφ α , wherethe last inequality follows from property 2 of linear bin divisions, since (5 / · ( φ α / ≤ φ α . Itfollows that | (F b ∪ B b (1)) ∩ P | ≤ | F b ∩ P | + |B b (1) ∩ P | ≤ α ( k + 1) φ α . By the deﬁnitions of F b and B b , and by property 4 of linear bin division, we have that the pointsin F b ∪ B b (1) are the furthest points from T α in X \ P . Thus, (F b ∪ B b (1)) ∩ P are the furthestpoints from T α in P . Therefore, (F b ∪ B b (1)) ∩ P ⊆ far α ( k +1) φ α ( P , T α ) , as we wished to show.This implies that R α ( k +1) φ α ( P , T α ) ≤ R ( P \ (F b ∪ B b (1)) , T α ) . (12)20ow, since B b is well-represented in P for X \ P , denoting the number of bins in B b by L , wehave that for all i ∈ [ L ] , |B b ( i ) ∩ P | / |B b ( i ) | ≤ α . Since B b is a partition of ( X \ P ) \ F b , we havethat P ∩ B b ( i ) = ( P \ F b ) ∩ B b ( i ) .Therefore, we can apply Lemma 5.5 with W := ( X \ P ) \ F b , B := B b , T := T α , A := P \ F b ,and r := 2 α . It follows that R (( P \ F b ) \ B b (1) , T α ) ≤ αR (( X \ P ) \ F b , T α ) = 3 αR kφ α ( X \ P , T α ) . (13)The last equality follows from the deﬁnition of F b . By combining Eq. (12) and Eq. (13), we getthat R α ( k +1) φ α ( P , T α ) ≤ αR kφ α ( X \ P , T α ) . Plugging in the deﬁnition of ψ , this implies that ψ ≤ R kφ α ( X \ P , T α ) ≤ R kφ α ( X, T α ) , as required. D.3 Bounding the number of selected far points

In this section, we prove Lemma 5.9, which bounds the number of points that are selected as centersand their distance from T α is more than ψ/ ( kτ ) . We ﬁrst prove two auxilliary lemmas. The ﬁrstlemma provides a relationship between the risks of consecutive bins. Lemma D.3.

Let L be the number of bins in B c . If E holds, then ∀ i ∈ [ L − , R ( B c ( i ) ∩ P , T α ) ≥ α R ( B c ( i + 1) , T α ) . Proof.

Fix i ∈ [ L − , and deﬁne b := max x ∈B c ( i +1) ρ ( x, T α ) . By property 4 of linear bin divisions, ρ ( x, T α ) ≥ b for all x ∈ B c ( i ) . Therefore, R ( B c ( i ) ∩ P , T α ) ≥ |B c ( i ) ∩ P | · b, and R ( B c ( i + 1) , T α ) ≤ |B c ( i + 1) | · b. It follows that R ( B c ( i ) ∩ P , T α ) R ( B c ( i + 1) , T α ) ≥ |B c ( i ) ∩ P ||B c ( i + 1) | . Thus, to prove the claim, it suﬃces to lower-bound the RHS by α/ . Since E holds, B c is well-represented in P for X \ P . Since | P | / | X \ P | = α/ (1 − α ) ≥ α , it follows that |B c ( i ) ∩ P | ≥ ( α/ |B c ( i ) | .By property 3 of linear bin divisions, |B c ( i ) | ≥ |B c ( i +1) | . Therefore, |B c ( i ) ∩ P | ≥ ( α/ |B c ( i +1) | ,which proves the claim.The next step is to prove a lower bound on the risk estimate ψ calculated in EstSelect . Recallthat ψ := α R α ( k +1) φ α ( P , T α ) . Lemma D.4. If E holds, then ψ ≥ R k +1) φ α ( X \ P , T α ) .Proof. First, note that if | X \ P | ≤ k + 1) φ α , then R k +1) φ α ( X \ P , T α ) = 0 , and so thelemma trivially holds. We henceforth assume that | X \ P | > k + 1) φ α . Recall that F c ≡ far k +1) φ α ( X \ P , T α ) , and B c is a ( φ α / -linear bin division of ( X \ P ) \ F c with respect to T α .Under the assumption, | X \ P | > k + 1) φ α , thus F c is non-trivial. Similarly, B c is non-trivial,since | ( X \ P ) \ F c | ≥ k + 1) φ α − k + 1) φ α > φ α / .To prove the lemma, we show the following inequality: R α ( k +1) φ α ( P , T α ) ≥ α R k +1) φ α ( X \ P , T α ) . (14)21his would immediately imply that ψ ≡ α R α ( k +1) φ α ( P , T α ) ≥ R k +1) φ α ( X \ P , T α ) , as re-quired.Denote by L the number of bins in B c . Deﬁne e X := L [ i =2 B c ( i ) ≡ ( X \ P ) \ (F c ∪ B c (1)) . See Figure 2 for an illustration of the sets used in this proof. To prove Eq. (14), we show thefollowing string of inequalities: R α ( k +1) φ α ( P , T α ) ≥ R ( P \ F c , T α ) ≥ α R ( e X, T α ) ≥ α R k +1) φ α ( X \ P , T α ) . (15) F c B c (1) B c (2) B c (3) B c ( L ) (cid:1) XX \ P Figure 2: The sets used in the proof of Lemma D.4. The horizontal axis represents the distance ofthe points in the set from T α , in decreasing order.First, we prove the ﬁrst inequality in Eq. (15) by removing far points from P . Recall that F c := far k +1) φ α ( X \ P , T α ) are the furthest points from T α in X \ P . Since E holds, F c iswell-represented in P for X \ P . Since | P | / | X \ P | = α/ (1 − α ) ≥ α , this implies that | P ∩ F c | ≥ α | F c | = 2( k + 1) φ α α. Hence, the k +1) φ α α furthest points from T α in P are in F c . Formally, far k +1) φ α α ( P , T α ) ⊆ F c .Therefore, R k +1) φ α α ( P , T α ) ≥ R ( P \ F c , T α ) , which proves the ﬁrst inequality of Eq. (15).Next, we prove the last inequality of Eq. (15). By property 2 of linear bin divisions, we have |B c (1) | ≤ · φ α ≤ φ α . Therefore, | F c ∪ B c (1) | ≤ | F c | + |B c (1) | ≤ k + 1) φ α + φ α ≤ k + 1) φ α . In addition, by the deﬁnition of F c and property 4 of linear bin divisions, the points in F c ∪ B c (1) are the furthest from T α out of the points in X \ P . Hence, F c ∪ B c (1) ⊆ far k +1) φ α ( X \ P , T α ) . Itfollows that e X ⊇ ( X \ P ) \ far k +1) φ α ( X \ P , T α ) . Therefore, R ( e X, T α ) ≥ R k +1) φ α ( X \ P , T α ) ,which proves the last inequality in Eq. (15).To complete the proof, we have left to prove the second inequality of Eq. (15), which states that R ( P \ F c , T α ) ≥ α R ( e X, T α ) . We have R ( P \ F c , T α ) = X i ∈ [ L ] R ( B c ( i ) ∩ ( P \ F c ) , T α ) = X i ∈ [ L ] R ( B c ( i ) ∩ P , T α ) . (16)22he last equality follows since B is a partition of ( X \ P ) \ F c , thus B ( i ) ∩ F c = ∅ . By Lemma D.3,since E holds, ∀ i ∈ [ L − , R ( B c ( i ) ∩ P , T α ) ≥ α R ( B c ( i + 1) , T α ) . It follows that R ( P \ F c , T α ) ≥ X i ∈ [ L − R ( B c ( i ) ∩ P , T α ) ≥ α L X i =2 R ( B c ( i ) , T α ) = α R ( e X, T α ) . The last equality follows from the deﬁnition of e X .This completes the proof of the second inequality of Eq. (15), thus proving the lemma.Based on the lemma above, Lemma 5.9 can now be proved. Proof of Lemma 5.9.

Denote the set of points in P that exceed the distance threshold used in EstSelect by T := { x ∈ P | ρ ( x, T α ) > ψ/ ( kτ ) } . Then N = |T | . Recall that F d := far k +1) φ α ( X \ P , T α ) . We have N ≤ |T \ F d | + | P ∩ F d | . Toupper bound |T \ F d | , note that from the deﬁnition of F d and T , R k +1) φ α ( X \ P , T α ) = R (( X \ P ) \ F d , T α ) ≥ R ( T \ F d , T α ) ≥ |T \ F d | · ψ/ ( kτ ) . On the other hand, by Lemma D.4, ψ ≥ R k +1) φ α ( X \ P , T α ) . It follows that |T \ F d | ≤ kτ .To upper bound | P ∩ F d | , consider ﬁrst the case where F d is trivial. In this case, | X \ P | < k + 1) φ α , thus | P ∩ F d | ≤ | P | = | X \ P | · ( | P | / | X \ P | ) ≤ k + 1) φ α · γ − α . If F d is non-trivial, then since E holds, F d is well-represented in P for X \ P . Since we have | P | / | X \ P | = γ − α , it follows that | P ∩ F d | ≤ γ − α | F d | = 32 γ − α · k + 1) φ α < k + 1) φ α γ − α . Therefore, in both cases, | P ∩ F d | ≤ k + 1) φ α γ − α . Combining this with the upper bound on |T \ F d | , the statement of the lemma follows. D.4 Risk approximation for small and large optimal clusters

In this section, we prove Lemma 5.10 and Lemma 5.11. The former lemma upper bounds the riskof the selected centers on optimal clusters whose center appears in the third stage, and is used forbounding the risk of points in small optimal clusters. The latter lemma upper bounds the risk of theselected centers on large optimal clusters, under an appropriate setting of algorithm parameters.First, we provide an auxiliary lemma, which will be used to prove both of these lemmas. Itshows that all points appearing in the last phase are close to some center in the set of selectedcenters.

Lemma D.5.

For all x ∈ P , ρ ( x, T out ) ≤ ψ/ ( kτ ) . roof. If x is selected by Alg. 1, then x ∈ T out , so ρ ( x, T out ) = 0 . Thus, in this case, the lemmatrivially holds. Now, suppose that x is not selected as a center. Recall that T α = { c , . . . , c k + } ,and denote i := argmin i ∈ [ k + ] ρ ( x, c i ) . Due to the selection conditions on line 12 of Alg. 1, wehave that, since x was not selected, ρ ( x, c i ) ≤ ψ/ ( kτ ) and Near i = TRUE when x is observed.From the latter, it follows that some y ∈ C i was selected such that ρ ( c i , y ) ≤ ψ/ ( kτ ) . Therefore, ρ ( x, T out ) ≤ ρ ( x, y ) ≤ ρ ( x, c i ) + ρ ( c i , y ) ≤ ψ/ ( kτ ) .The proof of Lemma 5.10 is now almost immediate. Proof of Lemma 5.10.

Fix i ∈ [ k ] . Observe that by the triangle inequality, R ( C ∗ i , T out ) ≤ R ( C ∗ i , { c ∗ i } ) + | C ∗ i | ρ ( c ∗ i , T out ) . By the assumption of the lemma, c ∗ i ∈ P . Applying Lemma D.5 to x := c ∗ i , we get that ρ ( c ∗ i , T out ) ≤ ψ/ ( kτ ) . Combined with the inequality above, this proves the lemma.We now turn to prove Lemma 5.11, which upper bounds the risk of the selected centers on largeoptimal clusters. First, we relate the risk of points from α -large optimal clusters in the third phaseto the risk of the clustering T α calculated in the ﬁrst phase. Lemma D.6. If E holds, τ = φ α and M = log(8 k + /δ ) , then R ( C α large ∩ P , T out ) ≤ R ( C α large , T α ) . Proof.

Recall that C , . . . , C k + are the clusters induced by the centers in T α . We distinguish between light clusters and heavy clusters . Light clusters are those that include many points from smalloptimal clusters in the third phase. Formally, the set of light clusters is deﬁned as Lt := { i ∈ [ k + ] | | C i ∩ P ∩ C α small | / | C i ∩ P | ≥ / } . Heavy clusters are non-light clusters, denoted Hv := [ k + ] \ Lt . We have R ( C α large ∩ P , T out ) ≤ X i ∈ Lt R ( C i ∩ P , T out ) + X i ∈ Hv R ( C i ∩ P ∩ C α large , T out ) . (17)We bound each of these terms separately.For light clusters, we show that these clusters are relatively small, and bound the risk contributedby each point in the cluster. The total number of points in light clusters is X i ∈ Lt | C i ∩ P | ≤ X i ∈ Lt | C i ∩ P ∩ C α small | ≤ | C α small | ≤ kφ α . (18)Moreover, for each x ∈ C i ∩ P for i ∈ Lt , we have by Lemma D.5 that ρ ( x, T out ) ≤ ψ/ ( kτ ) . Bythe assumptions of the lemma, τ := φ α . Hence, combining with Eq. (18), we get X i ∈ Lt R ( C i ∩ P , T out ) ≤ ψkφ α X i ∈ Lt | C i ∩ P | ≤ ψ ≤ R ( C α large , T α ) . (19)The last inequality follows from Lemma 5.8.For heavy clusters, let i ∈ Hv . Denote r i := ρ ( c i , T out ) . By the triangle inequality, R ( C i ∩ P ∩ C α large , T out ) ≤ R ( C i ∩ P ∩ C α large , { c i } ) + | C i ∩ P ∩ C α large | · r i . R i := { x ∈ C i ∩ P | ρ ( x, c i ) ≥ r i } . To upper bound r i , note that R ( C i ∩ P ∩ C α large , { c i } ) ≥ | R i ∩ C α large | · r i . Therefore, r i ≤ R ( C i ∩ P ∩ C α large , { c i } ) | R i ∩ C α large | . It follows that R ( C i ∩ P ∩ C α large , T out ) ≤ R ( C i ∩ P ∩ C α large , { c i } ) · | C i ∩ P ∩ C α large || R i ∩ C α large | ! . (20)We now lower-bound | R i ∩ C α large | . Alg. 1 selects the ﬁrst M = log(8 k + /δ ) points from cluster C i observed in the third phase. From part 4 of the event E , we have that at least one of those points,call it x ∗ , is closer to c i than at least half of the points in C i ∩ P . Formally, ρ ( c i , x ∗ ) ≤ ρ ( c i , x ) forat least half of the points x ∈ C i ∩ P . Since x ∗ ∈ T out , we have r i ≤ ρ ( c i , x ∗ ) , and so r i ≤ ρ ( c i , x ) for at least half of the points x ∈ C i ∩ P . It follows that | R i | ≥ | C i ∩ P | / .On the other hand, since i ∈ Hv , we have | R i ∩ C α small | ≤ | C i ∩ P ∩ C α small | ≤ | C i ∩ P | / . Therefore, | R i ∩ C α large | = | R i | − | R i ∩ C α small | ≥ | C i ∩ P | / ≥ | C i ∩ P ∩ C α large | / . Combined withEq. (20), we get ∀ i ∈ Hv , R ( C i ∩ P ∩ C α large , T out ) ≤ R ( C i ∩ P ∩ C α large , c i ) . It follows that P i ∈ Hv R ( C i ∩ P ∩ C α large , T out ) ≤ R ( C α large , T α ) . Combining this with Eq. (17) andEq. (19), we get the statement of the lemma.Based on the lemma above, the proof of Lemma 5.11 can be now provided. Proof of Lemma 5.11.

To bound R ( C α large , T out ) , we bound R ( C α large ∩ P , T out ) and R ( C α large ∩ ( P ∪ P ) , T out ) separately. The former is bounded using Lemma D.6, which gives R ( C α large ∩ P , T out ) ≤ R ( C α large , T α ) . To bound R ( C α large ∩ ( P ∪ P ) , T out ) , we ﬁrst show that at least half of the points in each α -largeoptimal cluster C ∗ i are in P . Fix i ∈ I α large . Since E holds, C ∗ i is well-represented in P ∪ P for X .Since | P ∪ P | / | X | = 2 α and α ≤ / , it follows that | C ∗ i ∩ ( P ∪ P ) | / | C ∗ i | ≤ · α ≤ α ≤ / .Since γ = 1 − α , we have P ∪ P ∪ P = X , hence | C ∗ i ∩ P | / | C ∗ i | ≥ / . Now, deﬁne an injection µ : P ∪ P → P such that x ∈ C ∗ i if and only if µ ( x ) ∈ C ∗ i . Such an injection exists since most ofthe points in each α -large optimal cluster are in P . For each i ∈ I α large and each x ∈ C ∗ i ∩ ( P ∪ P ) ,we have ρ ( x, T out ) ≤ ρ ( x, c ∗ i ) + ρ ( c ∗ i , µ ( x )) + ρ ( µ ( x ) , T out ) . (21)Now, since µ is an injection, we have that for any T ⊆ X , R ( { µ ( x ) | x ∈ C ∗ i ∩ ( P ∪ P ) } , T ) ≤ R ( C ∗ i ∩ P , T ) . Summing Eq. (21) over x ∈ C ∗ i ∩ ( P ∪ P ) , it follows that R ( C ∗ i ∩ ( P ∪ P ) , T out ) ≤ R ( C ∗ i ∩ ( P ∪ P ) , { c ∗ i } ) + R ( C ∗ i ∩ P , { c ∗ i } ) + R ( C ∗ i ∩ P , T out )= R ( C ∗ i , OPT) + R ( C ∗ i ∩ P , T out ) . i ∈ I α large , we get R ( C α large ∩ ( P ∪ P ) , T out ) ≤ R ( C α large , OPT) + R ( C α large ∩ P , T out ) . It follows that R ( C α large , T out ) ≤ R ( C α large ∩ ( P ∪ P ) , T out ) + R ( C α large ∩ P , T out ) ≤ R ( C α large , OPT) + 2 R ( C α large ∩ P , T out ) . Combined with the upper bound on R ( C α large ∩ P , T out ))