[PDF] Consistent k -Median: Simpler, Better and Robust

Abstract

In this paper we introduce and study the online consistent k -clustering with outliers problem, generalizing the non-outlier version of the problem studied in [Lattanzi-Vassilvitskii, ICML17]. We show that a simple local-search based online algorithm can give a bicriteria constant approximation for the problem with O( k 2 log 2 (nD)) swaps of medians (recourse) in total, where D is the diameter of the metric. When restricted to the problem without outliers, our algorithm is simpler, deterministic and gives better approximation ratio and recourse, compared to that of [Lattanzi-Vassilvitskii, ICML17].

Full PDF

CConsistent k -Median: Simpler, Better and Robust Xiangyu Guo ∗ Janardhan Kulkarni † Shi Li ‡ Jiayi Xian § Abstract

In this paper we introduce and study the online consistent k -clustering withoutliers problem, generalizing the non-outlier version of the problem studiedin Lattanzi-Vassilvitskii [18]. We show that a simple local-search based on-line algorithm can give a bicriteria constant approximation for the problem with O ( k log ( nD )) swaps of medians (recourse) in total, where D is the diameter ofthe metric. When restricted to the problem without outliers, our algorithm is sim-pler, deterministic and gives better approximation ratio and recourse, compared tothat of Lattanzi-Vassilvitskii [18]. Clustering is one of the most fundamental primitives in unsupervised machine learning, and k -median clustering is one of the most widely used primitives in practice. Input to the problem consistsof a set C of n points, a set F of potential median locations, a metric space d : ( C ∪ F ) × ( C ∪ F ) → R ≥ . The goal is to choose a subset S ⊆ F of cardinality at most k so as to minimize (cid:80) j ∈ C d ( j, S ) where d ( j, S ) := min i ∈ S d ( j, i ) is the distance from j to its nearest chosen median. The problemis known to be NP-hard and several constant factor approximation algorithms are known to theproblem [4, 16, 1, 19, 3].In many real world applications, the set of data points arrive over time in an online fashion. Forexample, images, videos, documents get added over time, and clustering algorithms in such applica-tions need to assign a label (or a median) to each newly added point in an online fashion. A naturalframework to study these online clustering problems is using competitive analysis , where the goal isto assign each arriving data point irrevocably to an existing cluster or start a new cluster containingthe point. Unfortunately, the competitive analysis framework is too strong, and it is provably impos-sible to maintain a good quality clustering of data points if one insists on the irrevocable decisions[20]. Recently, Lattanzi and Vassilvitskii [18] observed that in many applications the decisions neednot be irrevocable, however the online algorithm should not do too many re-clustering operations.Motivated by such settings they initiated the study of consistent k -clustering problem. The goal inconsistent k -clustering is twofold: • Quality : Guarantee at all the times that we have a clustering of the points that is a good approxi-mation to the optimum one. • Consistency:

The chosen medians should be stable and not change too frequently over the se-quence of data point insertions.Lattanzi and Vassilvitskii [18] measured the number of changes to the set of chosen medians usingthe notion of recourse – a concept also studied in online algorithms [13, 14, 2]. The total recourse ofan online algorithm is deﬁned as the number of changes it makes to the solution. Specially for the k -median problem, if S t corresponds to the set of chosen medians at time t and S t +1 at time t + 1 , then ∗ Department of Computer Science and Engineering, University at Buffalo, [email protected] † The Algorithms Group, Microsoft Research, Redmond, [email protected] ‡ Department of Computer Science and Engineering, University at Buffalo, [email protected] § Department of Computer Science and Engineering, University at Buffalo, [email protected]

Preprint. Under review. a r X i v : . [ c s . D S ] A ug he recourse at time step t + 1 is | S t +1 \ S t | . The total recourse of an online algorithm is the sum ofrecourse across all the time steps. An online algorithm with small recourse ensures that the chosenmedians do not change too frequently and hence is consistent. In particular, it forbids an algorithmfrom simply recomputing the solution from scratch at each time step. This is a very desirableproperty of a clustering algorithm in applications, as we do not want to change the label assigned todata points (which corresponds to cluster centers) as the data set keeps growing. Broadly speaking,recourse is also a measure of stability of an online algorithm. Lattanzi and Vassilvitskii [18] showedthat one can maintain an O (1) approximation to the k -median problem with O ( k log n ) totalrecourse. More recently, Cohen-Addad et al [9] studied facility location (and clustering problems)from the perspective of both dynamic and consistent clustering frameworks. See related work sectionfor more details.A drawback of using k -median clustering on real-world data sets is that it is not robust to noisy data,i.e., a few outliers can completely change the cost as well as structure of solutions. Recognizingthis shortcoming, Charikar et al. [5] introduced a robust version of k -median problem called k -median with outliers . The problem is similar to k -median problem except one crucial difference:An algorithm for k -median with outliers does not need to cluster all the points but can choose toignore a small fraction of the input points. The number of points an algorithm can ignore is given asa part of the input, and is typically set to be a small fraction of the overall input.Formally, in the k -median with outliers ( k - Med - O ) problem, we are given F , C , d and k as in the k -median problem. Additionally, we are given an integer z ≤ n = | C | . The goal is to choose a set S ⊆ F of k medians, so as to minimize min O ⊆ C : | O | = z (cid:80) j ∈ C \ O d ( j, S ) . The set O of points are called outliers and are not counted in the cost of the solution S . Thus theparameter z speciﬁes the number of outliers. Notice that when S is given, the set O that minimizes (cid:80) j ∈ C \ O d ( j, S ) can be computed easily: It contains the z points j ∈ C with the largest d ( j, S ) value. Therefore for convenience we shall simply use a set S ⊆ F of size k to denote a solution toa k - Med - O instance. The k - Med - O problem is not only a more robust objective function but alsohelps in removing outliers – a very important issue in the real world datasets [23, 7]. In fact sucha joint view of clustering and outlier elimination has been observed to be more effective, and hasattracted signiﬁcant attention both in theory and practice [8, 7, 15, 24, 17].In this paper, we study the k - Med - O problem in the online consistent k -clustering framework ofLattanzi and Vassilvitskii. The goal is to maintain a good quality (approximate) solution to theproblem at all times while minimizing the total recourse of the online algorithm. (The total recourseis still deﬁned as (cid:80) t | S t \ S t − | .) Though O (1) -approximation algorithms for k - Med - O are knownin the ofﬂine setting [8, 17], it seems hard to extend these algorithms to the online setting. Instead,we resort to bicrtieria approximate solutions for the k - Med - O problem: Deﬁnition 1.

We say a solution S ⊆ F of k medians is a ( β, α ) -bicriteria approximation to the k -median with outliers instance ( F, C, d, k, z ) for some α, β ≥ , if there exists a set O ⊆ C of sizeat most βz such that (cid:80) j ∈ C \ O d ( j, S ) ≤ α · opt , where opt is the cost of the optimum solution forthe instance with z outliers. So, a ( β, α ) -approximate solution removes at most βz outliers and has cost at most α times the costof the optimum solution with z outliers. Online Model for k -Median with Outliers We now describe the online model for the k - Med - O problem. Recall that a k - Med - O instance is given by F, C, d, k and z . As in [18], we assume k isgiven at the beginning of the algorithm, and C and d will be given online. We use n to denote thetotal number of clients that will arrive.Depending how F is given, we have two slightly different online settings: • In the static F setting, we assume F is independent of C and is given at the beginning of theonline algorithm. In each time step, one point in C arrives and its distances to F are revealed. One can also deﬁne the recourse as | S t +1 \ S t | + | S t \ S t +1 | , but if we assume | S t | = | S t +1 | = k , this isexactly · | S t +1 \ S t | . It is easy to see that in the k - Med - O problem, only distances between F and C are relevant. In the F = C setting, we assume we always have F = C . Whenever a point arrives, its distancesto previously arrived points are revealed, and the point is then added to both C and F .The F = C setting is more natural for clustering applications and is the one used in [18]. On theother hand, the static F setting arises in applications where we want to build k facilities to serve aset C of clients that arrive one by one. In these applications, the set F of potential locations to buildfacilities is independent of C and often does not change over time. The analysis of our algorithmworks directly for the static F setting, but needs a small twisting in the F = C setting.It remains to describe how z is given. For simplicity, we assume z is ﬁxed and given at the be-ginning of the algorithm; we call this the static z setting. In a typical application, z may increaseas more and more points arrive, and we call this setting the incremental z setting. We can re-duce the incremental z setting to the static z setting in the following way. We maintain an integer z (cid:48) ∈ [ z, (1 + (cid:15) ) z ) and use z (cid:48) as the given number of outliers. This will incur a factor of (1 + (cid:15) ) in the ﬁrst factor of the bicriteria approximation. During our algorithm, whenever z becomes morethan z (cid:48) , we update z (cid:48) to (cid:98) (1 + (cid:15) ) z (cid:99) . We deﬁne an epoch to be a maximal period of time steps withthe same z (cid:48) value. So within an epoch, z (cid:48) value does not change. The number of epochs is at most O (log (cid:15) n ) = O (cid:16) log n(cid:15) (cid:17) . Thus, if we have an online ( β, α ) -approximation algorithm for k - Med - O with total recourse R in the static z setting, we can obtain an ((1 + (cid:15) ) β, α ) -approximation algorithmwith total recourse O (cid:16) R log n(cid:15) (cid:17) in the incremental z setting. Thus throughout the paper, we onlyfocus on the static z setting, that is, z is ﬁxed and given at the beginning of the algorithm. Our Results

The main contribution of the paper is the following. Recall that n is the total numberof points that will arrive during the whole algorithm. We assume all distances are integers and deﬁne D to be the diameter of the metric d . Theorem 2.

There is a deterministic ( O (1) , O (1)) -bicriteria approximation algorithm for the on-line k -median with outliers problem with a total recourse of O (cid:0) k log n log( nD ) (cid:1) . When restricted to the case without outliers (i.e, z = 0 ), our algorithm gives the following. Theorem 3.

There is a deterministic O (1) -approximation algorithm to the consistent k -medianproblem with O (cid:0) k log n log( nD ) (cid:1) total recourse. The recourse achieved by our algorithm is O (log n ) factor better than the result of Lattanzi andVassilvitskii [18]. They also showed a lowerbound of Ω( k log n ) on the total recourse, hence ourresult also takes a step towards achieving the optimal recourse for this basic problem.Lemma 6 that appears later gives a formal statement of the guarantees obtained by our algorithm.In Lemma 6 we prove a more general result, where one can trade-off running time and the ap-proximation factor achieved by our algorithm by ﬁne-tuning certain parameters. In particular, byappropriate tuning of parameters we can achieve (cid:15) approximation in time n O (1 /(cid:15) ) , matching theapproximation factor achieved by local search algorithm in the ofﬂine setting, and also improves theunspeciﬁed O (1) factor achieved by [18]. Finally, our algorithm is deterministic while that of [18]is randomized and only succeeds with high probability. Our Techniques

Unlike many of the previous results on the online k -median problem and therelated facility location problem, which are based on Meyerson’s sampling procedure [22], our ap-proach is based on local search . When restricted to the k -median without outliers problem, at everytime step, it repeatedly applies ρ -efﬁcient swap operations until no such operations exist: These arethe swaps that can greatly decrease the cost of the solution (See Deﬁnition 4). Via standard analysis,one can show that this gives an O (1) -approximation for the problem. To analyze the total recourseof the algorithm, we establish a crucial lemma that the total cost increment due to the arrival ofclients is small. Compared to Meyerson’s sampling technique, local search has two advantages: (i)The approximation ratio can be made to be (cid:15) , which matches the best ofﬂine approximation ratiofor k -median based on local search. (ii) Local-search based algorithms are deterministic in general.Very recently, similar techniques were used in [12] to derive online algorithms for the related facilitylocation problem. We extend their ideas to the k -median problem, and more importantly, the k -median with outliers problem. In [18], it is assumed that D = poly( n ) and thus O (log( nD )) = O (log n ) . z . Tocircumvent the barrier, we handle the constraint in a soft manner: We introduce a penalty cost p , andinstead of requiring the number of outliers to be at most z , we pay a cost of p for every outlier in thesolution. By setting p appropriately, we can ensure that the algorithm does not produce too manyoutliers, while at the same time maintaining the O (1) approximation ratio. Indeed, in the ofﬂinesetting, our algorithm gives the ﬁrst ( O (1) , O (1)) -bicritiera approximation for k - Med - O based onlocal search. Prior to our work, in the ofﬂine setting, Gupta et al [15] developed a bicriteria approx-imation for the problem, but it needs to violate the outlier constraint by a factor of O ( k log( nD )) .On the other hand, though O (1) -approximation algorithms for k - Med - O were developed in [8] and[17], unlike our local search based algorithm, they are hard to extend to the online setting. Other Clustering Objectives

We remark that our algorithm and analysis can be easily extended tothe k -means objective, and more generally, the sum of q -th power of distances for any constant q ≥ . However for the cleanness of presentation, we choose to only focus on the k -median objective. Related work

As we mentioned earlier, Cohen-Addad et al [9] studied facility location and clus-tering problems from the perspective of both dynamic and consistent clustering frameworks. Inthe dynamic setting, data points are both added and deleted, and the emphasis is to maintain goodquality solutions while minimizing the time it takes to update the solutions. For the facility locationproblem, they gave an O (1) approximation algorithm with almost optimal O ( n ) total recourse and O ( n log n ) per step update time. They also extended their algorithm for facility location to the k -median and k -means problems (without outliers), achieving a constant factor approximate solutionwith ˜ O ( n + k ) per step update time. Unfortunately, they do not state the total recourse of theiralgorithms. To our understanding, the total recourse of their algorithms can be as large as O ( n ) .However, they also consider a harder setting where data points are being both inserted and deleted.We believe that ﬁnding a consistent k -clustering algorithm, where the emphasis is more on the sta-bility of cluster centers than the update time, for the case when data points are inserted and deletedis an important open problem.For more details regarding clustering problems in the context of dynamic and online algorithms, werefer the readers to [6, 22, 10, 11, 12] and references therein. All the omitted proofs are given in the supplementary material. k -Median with Outliers In this section, we describe an ofﬂine local search algorithm for k - Med - O that achieves an ( O (1) , O (1)) -bicriteria-approximation ratio. To allow trade-offs among the approximation ratio,number of outliers and running time, we introduce two parameters: an integer (cid:96) ≥ and a realnumber γ > . The algorithm gives (cid:16)(cid:0) (cid:96) )(1 + γ ) , (3 + (cid:96) ) (cid:0) γ (cid:1)(cid:17) -bicriteria approximationin n O ( (cid:96) ) time. In particular, we can set (cid:96) = γ = Θ(1 /(cid:15) ) to get an approximation ratio (cid:15) with O (cid:16) z(cid:15) (cid:17) outliers and n O (1 /(cid:15) ) -time, matching the best approximation ratio for k -median based on lo-cal search. To obtain any ( O (1) , O (1)) -bicriteria approximation, it sufﬁces and is convenient to set (cid:96) = γ = 1 . This ofﬂine algorithm will serve as the baseline for our online algorithm for k - Med - O .The main idea behind the algorithm is that we convert the problem into the k -median with penalty problem. Compared to k - Med - O , in the problem we are not given the number z of outliers, butinstead we are given a penalty cost p ≥ for not connecting a point. Our goal is to choose k medians and connect some points to the k medians so as to minimize the sum of the connection costand penalty cost. So, we shall use the parameter p to control the number of outliers in a soft way.Indeed, the k -median with penalty problem is equivalent to the original k -median problem upto the modiﬁcation of the metric. For every two points u, v ∈ F ∪ C , we deﬁne d p ( u, v ) :=min { d ( u, v ) , p } . Then it is easy to see that, the k -median with penalty problem becomes the k -median problem on the metric d p . For a set S ⊆ F of k medians, we deﬁne cost p ( S ) := (cid:80) j ∈ C d p ( j, S ) to be the cost of the solution S to the k -median instance with metric d p , or equiva-lently, the k -median instance on metric d with per-outlier penalty cost p .4 wap Operations for k -Median with Outliers Given a set S ⊆ F of k medians, and an integer (cid:96) ≥ , an (cid:96) -swap on S is a pair ( A ∗ , A ) of medians, such that A ⊆ S, A ∗ ⊆ F \ S and | A | = | A ∗ | ≤ (cid:96) . Applying the swap operation ( A ∗ , A ) on S will update S to S ∪ A ∗ \ A . Notice that after theoperation S still has size k . We simply say ( A ∗ , A ) is a swap on S if it is an (cid:96) -swap for some (cid:96) ≥ . Deﬁnition 4 (Efﬁcient swaps) . For any ρ, p ≥ , a swap ( A ∗ , A ) on a solution S ⊆ F, | S | = k issaid to be ρ -efﬁcient w.r.t the penalty cost p , if we have cost p ( S ∪ A ∗ \ A ) < cost p ( S ) − | A | ρ . In particular a -efﬁcient swap with respect to some penalty cost p ≥ is a swap whose applicationon S will strictly decrease cost p ( S ) . The efﬁciency parameter ρ will be used later in the onlinealgorithm, in which we apply a swap only if it can decrease cost p ( S ) signiﬁcantly to guarantee thatthe recourse of our algorithm is small.The following theorem can be shown by modifying the analysis for the classic (3+ (cid:96) ) -approximationlocal search algorithm for k -median [25]. We leave its proof to the supplementary material. Theorem 5.

Let S and S ∗ be two sets of medians with | S | = | S ∗ | = k . Let p, ρ ≥ , and (cid:96) ≥ isan integer. If there are no ρ -efﬁcient (cid:96) -swaps on S w.r.t the penalty cost p , then we have cost p ( S ) ≤ (cid:88) j ∈ C min (cid:26)(cid:18) (cid:96) (cid:19) d p ( j, S ∗ ) , (cid:18) (cid:96) (cid:19) p (cid:27) + kρ. To understand the theorem, we ﬁrst assume ρ = 0 ; thus S is a local optimum for the k -median in-stance deﬁned by the metric d p . If we replace min (cid:8)(cid:0) (cid:96) (cid:1) d p ( j, S ∗ ) , (cid:0) (cid:96) (cid:1) p (cid:9) by (cid:0) (cid:96) (cid:1) d p ( j, S ∗ ) ,then the theorem says that a local optimum solution for k -median is a (cid:0) (cid:96) (cid:1) -approximation, whichis exactly the locality gap theorem for k -median. Using that d p has diameter p , we can obtain theimprovement as stated in the theorem; this will be used to give a better trade-off between the twofactors in the bicriteria approximation ratio. When ρ ≥ , we lose an additive factor of kρ on theright side of the inequality.Theorem 5 immediately gives a (cid:16)(cid:0) (cid:96) (cid:1) (1 + γ ) , (cid:0) (cid:96) (cid:1)(cid:0) γ (cid:1)(cid:17) -bicriteria approximation al-gorithm for the k - Med - O problem, for any γ > . By binary search, we assume we know theoptimum value opt for the k - Med - O instance. Let p = (3 (cid:96) +2) opt ( (cid:96) +1) γz . Then we start from an ar-bitrary set S of k medians, and repeatedly apply -efﬁcient (cid:96) -swaps w.r.t penalty cost p on S ,until no such swaps can be found. The running time of the algorithm is n O ( (cid:96) ) . Applying The-orem 5 with S ∗ being the optimum solution for the k - Med - O instance, we have that the ﬁnal so-lution S has cost p ( S ) ≤ (cid:80) j ∈ C min (cid:110)(cid:0) (cid:96) (cid:1) d p ( j, S ∗ ) , (cid:0) (cid:96) (cid:1) p (cid:111) ≤ (cid:0) (cid:96) (cid:1) opt + (cid:0) (cid:96) (cid:1) zp .The second inequality holds since for inliers j in the solution S ∗ , we have d p ( j, S ∗ ) ≤ d ( j, S ∗ ) and for outliers j we have d p ( j, S ∗ ) ≤ p . We return S as the set of medians, and let j bean outlier if d p ( j, S ) = p . Then, the number of outliers our algorithm produces is at most (cid:0) (cid:96) (cid:1) z + (cid:0) (cid:96) (cid:1) opt p = (cid:0) (cid:96) (cid:1) z + (cid:0) (cid:96) (cid:1) γz = (cid:0) (cid:96) (cid:1) (1 + γ ) z . The cost of the solutionis at most (cid:0) (cid:96) (cid:1) opt + (cid:0) (cid:96) (cid:1) zp = (cid:0) (cid:96) (cid:1) opt + (cid:0) (cid:96) (cid:1) opt γ = (cid:0) (cid:96) (cid:1) opt (cid:16) γ (cid:17) . k -Median with Outliers In this section, we give our online algorithm for k - Med - O that proves Theorem 2 (and thus Theo-rem 3). As mentioned earlier, indeed we give a more general result that allows trade-offs betweenthe approximation ratio, the number of outliers and running time: Lemma 6.

Let (cid:96) ≥ be an integer, (cid:15) > be small enough and γ > be a real number. Thereis a deterministic n O ( (cid:96) ) -time algorithm for online k -median with outliers with a total recourseof O (cid:0) k log n log( nD ) (cid:15) (cid:1) . The algorithm achieves a bicriteria approximation of (cid:16) − (cid:15) (cid:0) (cid:96) (cid:1) (1 + γ ) , − (cid:15) (cid:0) (cid:96) (cid:1) (1 + γ (cid:1)(cid:17) in the static F setting, and (cid:16) − (cid:15) (cid:0) (cid:96) (cid:1) (1 + γ ) , − (cid:15) (cid:0) (cid:96) (cid:1) (1 + γ (cid:1)(cid:17) inthe F = C setting. When the distances are not polynomially bounded, the running time of the algorithm may be large; butusing an appropriate ρ we can reduce the running time to polynomial by losing a factor of (1 + (cid:15) ) in theapproximation ratio.

5y setting (cid:96) = γ = 1 and (cid:15) to be a small enough constant, Lemma 6 implies Theorem 2. Onthe other hand, one can set (cid:96) = γ = (cid:15) to achieve an approximation ratio of O ( (cid:15) ) with O ( z(cid:15) ) outliers and running time n O (1 /(cid:15) ) . The goal of this section is to prove Lemma 6. To explain ourmain ideas more clearly, we assume F is static: the set F of potential medians is ﬁxed and given atthe beginning of the online algorithm. In the supplementary material Section D, we show how thealgorithm can be extended to the setting where F = C .To avoid the case where the optimum solution has cost , we add an additive factor of . in alldeﬁnitions of costs: the cost of a solution to a k - Med - O instance, and cost p ( S ) . We can think ofthat in the instance we have one point and one median that are . distance apart and have distance ∞ to all other points in the metric. Since all distances are integers and the approximation ratio weare aiming at is less than 10, the additive factor of . does not change our approximation ratio.Theorem 5 holds with an additive factor of . added to the right side of the inequality.In essence, our algorithm repeatedly applies ρ -efﬁcient swaps w.r.t the penalty cost p , for somecarefully maintained parameters ρ and p . The main algorithm is described in Algorithm 1. In eachtime t , we add the arrival point j t to C (Step 3). Then we repeatedly perform (cid:0) ρ := (cid:15) · cost p ( S ) k (cid:1) -efﬁcient swaps until no such operation exists (loop 4). If the solution S obtained has more than − (cid:15) (cid:0) (cid:96) (cid:1) (1 + γ ) z outliers (deﬁned as the points j with d p ( j, S ) = p , or equivalently d ( j, S ) ≥ p ),we then double p (Step 7) and redo the while loop. At the beginning of the algorithm, we set p to bea small enough number (Step 1). Algorithm 1

Online algorithm for k -median p ← min (cid:8) γz , . (cid:9) for t ← to n do C ← C ∪ { j t } while there exists a (cid:16) ρ := (cid:15) · cost p ( S ) k (cid:17) -efﬁcient (cid:96) -swap on S w.r.t the penaty cost p do perform the swap operation if d p ( j, S ) = p for more than − (cid:15) (cid:0) (cid:96) (cid:1) (1 + γ ) z points j ∈ C then p ← p goto We start from analyzing the approximation ratio of the algorithm. At any moment of the algorithm,we use opt to denote the cost of the optimum solution for the k - Med - O problem deﬁned by thecurrent point set C . Theorem 5 gives the following. Claim 7.

At any moment immediately after the while loop (Loop 4), we have (1 − (cid:15) ) cost p ( S ) ≤ (cid:0) (cid:96) ) opt + (cid:0) (cid:96) (cid:1) zp . Lemma 8.

At any moment, we have p ≤ (cid:96) +2) opt γ ( (cid:96) +1) z . Combining Claim 7 and Lemma 8, at the end of each time t , we have (1 − (cid:15) ) cost p ( S ) ≤ (cid:0) (cid:96) (cid:1) opt + (cid:0) (cid:96) (cid:1) zp ≤ (cid:0) (cid:96) (cid:1) opt + (cid:0) (cid:96) (cid:1) z (cid:96) +2) opt γ ( (cid:96) +1) z = (cid:0) (cid:96) (cid:1) opt + (cid:0) (cid:96) (cid:1) opt γ = (cid:0) (cid:96) (cid:1)(cid:0) γ (cid:1) opt .(This assumes z ≥ , but the resulting inequality holds trivially when z = 0 .) Deﬁning the outliersto be the points j with d p ( j, S ) = p , our online algorithm achieves a bi-criteria approximation ratioof (cid:16) − (cid:15) (cid:0) (cid:96) (cid:1) (1 + γ ) , − (cid:15) (cid:0) (cid:96) (cid:1)(cid:0) γ (cid:1)(cid:17) since Step 6 guarantees that the solution S has atmost − (cid:15) (cid:0) (cid:96) (cid:1) (1 + γ ) z outliers. We now proceed to the analysis of the total recourse of the online algorithm. For simplicity, wedeﬁne opt (cid:48) := min S (cid:48) ⊆ F : | S (cid:48) | = k cost p ( S (cid:48) ) to be the cost of the optimum for the current k -medianinstance with metric d p . Notice the difference between opt and opt (cid:48) : opt is for the original k - Med - O problem and opt (cid:48) is for the k -median with penalty problem (or k -median with metric d p ). So, opt (cid:48) C and the current p . Like opt , opt (cid:48) can only increase duringthe course of the algorithm as C only enlarges and p only increases. Claim 9.

At any moment, we have p ≤ O (1) · opt (cid:48) . We deﬁne a stage of the online algorithm to be a period of the algorithm between two adjacentmoments when we increase p in Step 7. That is, a stage is an inclusion-wise maximal period of thealgorithm in which the value of p does not change. From now on, we ﬁx a stage and let p be thevalue of p in the stage. So, p is ﬁxed during the stage. Assume the stage starts at time τ and ends attime τ (cid:48) . Notice that the stage contains the tail of time τ , the head of time τ (cid:48) , and the entire time τ (cid:48)(cid:48) for any τ (cid:48)(cid:48) ∈ [ τ + 1 , τ (cid:48) − . An exceptional case is that τ = τ (cid:48) , in which case the stage containssome period within time τ .For every t ∈ [ τ, τ (cid:48) ] , let opt (cid:48) t be the optimum value for the k -median instance with C = { j , j , · · · , j t } and metric d p . So, for any t ∈ [ τ, τ (cid:48) ] , opt (cid:48) t is the value of opt (cid:48) at any momentthat is in the stage and after Step 3 at time t . For t ∈ [ τ + 1 , τ (cid:48) ] , we deﬁne ∆ t to be the value of cost p ( S ) after Step 3 at time t , minus that before Step 3. We view this as the increase of cost p ( S ) due to the arrival of j t . Let ∆ τ be the value of cost p ( S ) at the beginning of the stage; that is, themoment immediately after p is increased to p . Lemma 10.

For every T ∈ [ τ, τ (cid:48) ] , we have (cid:80) Tt = τ ∆ t ≤ O ( k log n ) opt (cid:48) T . We need the following technical lemma from [12].

Lemma 11.

Let b ∈ R H ≥ for some integer H ≥ . Let B H (cid:48) = (cid:80) H (cid:48) h =1 b h for every H (cid:48) =0 , , · · · , H . Let < a ≤ a ≤ · · · ≤ a H be a sequence of real numbers and α > suchthat B H (cid:48) ≤ α · a H (cid:48) for every H (cid:48) ∈ [ H ] . Then we have H (cid:88) h =1 b h a h ≤ α (cid:18) ln a H a + 1 (cid:19) . We deﬁne H = τ (cid:48) − τ + 1 . For every t ∈ [ τ (cid:48) , τ ] , we deﬁne b t − τ (cid:48) +1 = ∆ t and a t − τ (cid:48) +1 = opt (cid:48) t . Wedeﬁne B T − τ (cid:48) +1 for every T = τ − , τ, · · · , τ (cid:48) to be (cid:80) Tt = τ b t − τ (cid:48) +1 = (cid:80) Tt = τ ∆ t . By Lemma 10we have B H (cid:48) ≤ α opt (cid:48) H (cid:48) + τ − = α · a H (cid:48) for some α = O ( k log n ) and every H (cid:48) ∈ [ H ] . In time t within the stage, cost p ( S ) ﬁrst increases by ∆ t in Step 3 (or becomes ∆ τ at the beginning of thestage if t = τ ). Then for every median we swap inside the while loop 4, we decrease cost p ( S ) byat least (cid:15) cost p ( S ) k ≥ (cid:15) · opt (cid:48) t k , due to the use of the efﬁcient swaps. Noticing that opt (cid:48) t is non-decreasingin t , using Lemma 11 we can bound the total recourse in the stage by (cid:80) τ (cid:48) t = τ ∆ t (cid:15) opt (cid:48) t /k = k(cid:15) (cid:80) Hh =1 b h a h ≤ k(cid:15) α (cid:16) ln a H a + 1 (cid:17) = αk(cid:15) (cid:16) ln opt (cid:48) τ (cid:48) opt (cid:48) τ + 1 (cid:17) . Now it is time to consider all stages [ τ, τ (cid:48) ] together. The summation of ln opt (cid:48) τ (cid:48) opt τ over all stages isthe ln of the product of opt (cid:48) τ (cid:48) opt (cid:48) τ over all stages. For some time t that crosses many different stages, opt (cid:48) t values depend on the p value of a stage. However as p increases, opt (cid:48) t can only increase.Therefore, the summation is at most ln of the ratio between the maximum possible opt (cid:48) value andthe minimum possible opt (cid:48) value. So, this is at most O (log( nD )) . There are most log O ( nD ) = O (log( nD )) stages. Thus, the total recourse over the whole algorithm is at most αk(cid:15) · O (log( nD )) = O (cid:16) k log n log( nD ) (cid:15) (cid:17) . This ﬁnishes the proof of Lemma 6. In this section, we corroborate our theoretical ﬁndings by performing experiments on real worlddatasets. Our goal is to empirically show that the local search algorithm is stable and does fewreclusterings, while maintaining a good approximation factor.

Algorithm implementation:

We modiﬁed our algorithm slightly to make it faster: when a new datapoint comes, instead of conducting local search directly, we assign the point to its nearest center;then we check whether the current cost is at least (1 + α ) times the cost resulting from the lastapplication of local search , and if not we continue to the next data point without doing any localoperations. It is easy to see that this will increase our approximation ratio by a (1 + α ) factor.7hough this modiﬁcation doesn’t improve our worst-case recourse bound, it reduces the numberof local operations needed when the incoming data are non-adversarial, which is often the case inpractice. Throughout the experiment we set α = 0 . . Data set and parameter setting:

Similar to [18], we test the algorithm on three UCI data sets [21]:(i) S

KIN with , data points of dimension 4; (ii) C OVERTYPE with , data points ofdimension 54; In the experiment we’ll only use the ﬁrst features of C OVERTYPE because otherfeatures are categorical. (iii) L

ETTER with , data points of dimension 16. To keep the durationof experiments short, we restrict the experiments to the ﬁrst 10K data points in each data set. Weset the algorithm parameters (cid:15) = 0 . and γ = 1 ; these were chosen to minimize the number ofdiscarded outliers. We set the available center locations F = C , so when a new data point comes,it will be added to both F and C . Throughout the experiment, we set the number of outliers to be z = 200 , and tried three different values of k ∈ { , , } . We observe that in all the runs,our algorithm removes at most outliers, hence achieving an approximation factor of . on thenumber of discarded outliers. Results:

We ﬁrst show the how the recourse grows overtime in Figure 1. One can observe thatthe recourse dependence on k is roughly O ( k log n ) instead of the O ( k log n log( nD )) worst-casebound predicted by our theoretical result. We also observe that the growth rate of recourse is lowerfor C OVERTYPE and L

ETTER data sets compared to S

KIN . This is because of the data ordering inS

KIN ; if we randomly shufﬂe the S

KIN data set and re-run the algorithm then we get a graph similarto the other two data sets. n : r ec o u r s e k = 10 k = 50 k = 100 (a) S KIN n : k = 10 k = 50 k = 100 (b) C OVERTYPE n : k = 10 k = 50 k = 100 (c) L ETTER

Figure 1: Recourse over time. The x -axis is plotted in the log-scale n : . . . . . a pp r o x . r a t i o k = 10 k = 50 k = 100 (a) S KIN n : . . . k = 10 k = 50 k = 100 (b) C OVERTYPE n : . . . k = 10 k = 50 k = 100 (c) L ETTER

Figure 2: Estimated approximation ratio over time.Now we turn to the quality of clustering maintained by our algorithm. Since the optimal solutionis hard to compute, we use the clustering produced by ofﬂine k -means −− algorithm of [7] as ancoarse estimation of OPT. Speciﬁcally, for every 50 newly-arrived data points, we compute 5 ofﬂine k -means −− solutions (with different initializations) for all already arrived data points, and choosethe best one as the estimation for OPT at this time point. Then we linearly interpolate between theseestimations to get an OPT curve for every time point. Figure 2 shows the estimated approximationratio over time. We see that the ratio is bounded by . most of the times. One might notice that theratio sometimes even falls below 1. This is because of two reasons: 1) we only have an estimate ofthe real OPT; 2) the bi-criteria approximation means our algorithm might remove more than z = 200 outliers, while the OPT is calculated by removing at most z outliers.Lastly, we also ran our experiments by allowing z to increase over time and noticed similar behavior.Due to space constraints, we give those results in the supplementary material.8 eferences [1] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, andVinayaka Pandit. Local search heuristic for k-median and facility location problems. In Pro-ceedings of STOC 2001 .[2] Aaron Bernstein, Jacob Holm, and Eva Rotenberg. Online bipartite matching with amortized O (log 2 n ) replacements. J. ACM , 66(5):37:1–37:23, 2019.[3] Jaroslaw Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. Animproved approximation for k-median and positive correlation in budgeted optimization.

ACMTrans. Algorithms , 13(2):23:1–23:31, March 2017.[4] M. Charikar, S. Guha, D. Shmoys, and E. Tardos. A constant-factor approximation algorithmfor the k-median problem. 1999.[5] M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for facility loca-tion problems with outliers.

Proceedings of ACM-SIAM Symposium on Discrete Algorithms(SODA) , 2001.[6] Moses Charikar, Chandra Chekuri, Tom´as Feder, and Rajeev Motwani. Incremental clusteringand dynamic information retrieval. In

Proceedings of the Twenty-ninth Annual ACM Sympo-sium on Theory of Computing , STOC ’97, pages 626–635, New York, NY, USA, 1997. ACM.[7] Sanjay Chawla and Aristides Gionis. k-means −− : A uniﬁed approach to clustering and outlierdetection. In Proceedings of the 13th SIAM International Conference on Data Mining, May2-4, 2013. Austin, Texas, USA. , pages 189–197, 2013.[8] Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In

Proceedings of ACM-SIAM SODA 2008 .[9] Vincent Cohen-Addad, Niklas Hjuler, Nikos Parotsidis, David Saulpic, and ChrisSchwiegelshohn. Fully dynamic consistent facility location. In Hanna M. Wallach, HugoLarochelle, Alina Beygelzimer, Florence d’Alch´e-Buc, Emily B. Fox, and Roman Garnett, ed-itors,

Advances in Neural Information Processing Systems 32: Annual Conference on NeuralInformation Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC,Canada , pages 3250–3260, 2019.[10] Dimitris Fotakis. Online and incremental algorithms for facility location.

SIGACT News ,42(1):97–131, March 2011.[11] Gramoz Goranci, Monika Henzinger, and Dariusz Leniowski. A tree structure for dynamicfacility location. In , pages 39:1–39:13, 2018.[12] Xiangyu Guo, Janardhan Kulkarni, Shi Li, and Jiayi Xian. The power of recourse: Betteralgorithms for facility location in online and dynamic models, 2020.[13] Anupam Gupta and Amit Kumar. Greedy algorithms for steiner forest. In

Proceedings of theForty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland,OR, USA, June 14-17, 2015 , pages 871–878, 2015.[14] Anupam Gupta, Amit Kumar, and Cliff Stein. Maintaining assignments online: Matching,scheduling, and ﬂows. In

Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium onDiscrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014 , pages 468–479,2014.[15] Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Localsearch methods for k-means with outliers.

Proceedings, International Conference on VeryLarge Data Bases (VLDB) , 10(7):757–768, March 2017.[16] K. Jain and V. V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation.

Journal of the ACM ,48(2):274 – 296, 2001.[17] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-medianand k-means with outliers via iterative rounding. In

Proceedings of the 50th Annual ACMSIGACT Symposium on Theory of Computing , STOC 2018, page 646–659, New York, NY,USA, 2018. Association for Computing Machinery.918] Silvio Lattanzi and Sergei Vassilvitskii. Consistent k-clustering. In Doina Precup andYee Whye Teh, editors,

Proceedings of the 34th International Conference on Machine Learn-ing, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of

Proceedings ofMachine Learning Research , pages 1975–1984. PMLR, 2017.[19] S. Li and O. Svensson. Approximating k-median via pseudo-approximation.

ACM Symp. onTheory of Computing (STOC) , 2013.[20] Edo Liberty, Ram Sriharsha, and Maxim Sviridenko. An algorithm for online k-means clus-tering. In , pages 81–89. SIAM, 2016.[21] M. Lichman. UCI machine learning repository, 2013.[22] A. Meyerson. Online facility location. In

Proceedings of the 42Nd IEEE Symposium onFoundations of Computer Science , FOCS ’01, pages 426–, Washington, DC, USA, 2001. IEEEComputer Society.[23] Lionel Ott, Linsey Pang, Fabio T Ramos, and Sanjay Chawla. On integrated clustering andoutlier detection. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Wein-berger, editors,

Advances in Neural Information Processing Systems 27 , pages 1359–1367.2014.[24] Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn, and Wolfram Wiesemann. Size matters:Cardinality-constrained clustering and outlier detection via conic optimization.

SIAM Journalon Optimization , 29(2):1211–1239, 2019.[25] David P. Williamson and David B. Shmoys.

The Design of Approximation Algorithms . Cam-bridge University Press, New York, NY, USA, 1st edition, 2011.10

Missing Proofs from Section 2

In this section we prove Theorem 5.

Theorem 5.

By making copies of medians, we assume S and S ∗ are disjoint. For every j ∈ C , deﬁne σ ( j ) and σ ∗ ( j ) to be the closest median of j in S and S ∗ respectively. Let O ∗ = (cid:8) j : d p ( j, S ∗ ) ≥ (cid:96) +13 (cid:96) +2 p (cid:9) ; these are the points j with min (cid:110)(cid:0) (cid:96) (cid:1) d p ( j, S ∗ ) , (cid:0) (cid:96) (cid:1) p (cid:111) = (cid:0) (cid:96) (cid:1) p . For every i ∗ ∈ S ∗ , deﬁne φ ( i ∗ ) to be the nearest median of i ∗ in S , according to the metric d p , breaking tiesarbitrarily. We partition S into three parts as follows: • S := { i ∈ S : φ − ( i ) = ∅} . • S := { i ∈ S : 1 ≤ | φ − ( i ) | ≤ (cid:96) } . • S + := { i ∈ S : | φ − ( i ) | > (cid:96) } .Let S ∗ := φ − ( S ) (which is deﬁned as (cid:83) i ∈ S φ − ( i ) ) and S ∗ + := φ − ( S + ) ; thus ( S ∗ , S ∗ + ) is apartition of S ∗ . Moreover, | S | ≤ | S ∗ | and | S + | ≤ | S ∗ + | / ( (cid:96) + 1) . This implies | S | = k − | S | − | S + | ≥ ( | S ∗ | − | S | ) + ( k − | S ∗ | ) − | S ∗ + | / ( (cid:96) + 1)= ( | S ∗ | − | S | ) + | S ∗ + | − | S ∗ + | / ( (cid:96) + 1) = | S ∗ | − | S | + (cid:96)(cid:96) + 1 | S ∗ + | . (1)We deﬁne a random mapping β : S ∗ → S ∪ S in the following way. See Figure i for the illustrationof the procedure. We ﬁrst deﬁne β over S ∗ . For every i ∈ S , we take an arbitrary i ∗ ∈ φ − ( i ) anddeﬁne β ( i ∗ ) = i ; for all other facilities i ∗(cid:48) in φ − ( i ) , we deﬁne β ( i ∗(cid:48) ) to be an arbitrary median in S . So, | S | medians in S ∗ are mapped to S by β and the remaining | S ∗ | − | S | facilities in S ∗ are mapped to S . By (1), we can make β restricted to S ∗ an injective function. Moreover, at least (cid:96)(cid:96) +1 | S ∗ + | facilities in S do not have preimages so far; call the facilities free facilities. Then, we map S ∗ + to these free facilities in a random way so that each free facility is mapped to at most twice andin expectation, each free facility in expectation has at most (cid:0) (cid:96) (cid:1) pre-images in the function β . S S S + S ∗ S ∗ + Figure i: The deﬁnition of the function β . The vertices at the top are S , the vertices at the bottom S ∗ , (cid:96) = 3 , and the dashed lines give the deﬁnition of φ . Then S , S , S + , S ∗ , S ∗ + are depicted in theﬁgure, and a possible function β is given by the solid lines and curves.With the random β deﬁned, we describe a set of test swaps that will be used in our analysis. Forevery i ∈ S , we have a test swap ( φ − ( i ) , β ( φ − ( i ))) . For every i ∗ ∈ S ∗ + , we have a test swap ( { i ∗ } , { β ( i ∗ ) } ) . It is easy to see that each test swap ( A ∗ , A ) has A ∗ ⊆ F ∗ , A ⊆ F and | A ∗ | = | A | ≤ (cid:96) . Moreover, we have the following properties:(P1) Every median in i ∗ ∈ S ∗ is swapped in exactly once in all test swaps.(P2) In expectation over all possible β ’s, every median in i ∈ S is swapped out at most (cid:96) timesin the test swaps. i ∗ σ − ( A ) σ ∗− ( A ∗ ) reconnect to σ ∗ ( j )decrease = d p ( j, S ) − d p ( j, S ∗ )reconnect arbitrarilydecrease ≥ − d p ( j, S ∗ ) decrease ≥ d p ( j, S ) − p reconnect to φ ( σ ∗ ( j )) Figure ii: How to reconnect points and the lower bound for the decrement in the connection cost foreach point j , using the Venn diagram for the three sets σ − ( A ) , σ ∗− ( A ∗ ) and O ∗ .(P3) For any test swap ( A ∗ , A ) , we have φ − ( A ) ⊆ A ∗ .(P1) and (P2) follow from the construction of β . To see (P3), consider the two types of test swaps.If the test swap is ( { i ∗ } , { β ( i ∗ ) } ) for some i ∗ ∈ S ∗ + , then β ( i ∗ ) ∈ S and thus φ − ( β ( i ∗ )) = ∅ . Ifthe test swap is ( φ − ( i ) , β ( φ − ( i ))) for some i ∈ S , then β ( φ − ( i )) contains i and all the otherelements in the set are in S . Thus φ − ( β ( φ − ( i ))) = φ − ( i ) .Focus on a ﬁxed test swap ( A ∗ , A ) . After opening A ∗ and closing A , we can reconnect a subsetof points in σ − ( A ) ∪ σ ∗− ( A ∗ ) . We guarantee that all points in σ − ( j ) will be reconnected. SeeFigure ii for how we reconnect the points. • For a point j ∈ σ ∗− ( A ∗ ) \ O ∗ , we reconnect j from σ ( j ) to σ ∗ ( j ) ∈ A ∗ . The decrease in theconnection cost of j is d p ( j, σ ( j )) − d p ( j, σ ∗ ( j )) = d p ( j, S ) − d p ( j, S ∗ ) . • For a point j ∈ σ − ( A ) \ σ ∗− ( A ∗ ) \ O ∗ , we reconnect j to φ ( σ ∗ ( j )) . Notice that σ ∗ ( j ) / ∈ A ∗ . By(P3), we have φ ( σ ∗ ( j )) / ∈ A . Thus the connection is valid. By triangle inequalities and deﬁnitionof φ , for every j ∈ σ − ( A ) \ σ ∗− ( A ∗ ) \ O ∗ , we have d p ( j, φ ( σ ∗ ( j ))) ≤ d p ( j, σ ∗ ( j )) + d p ( σ ∗ ( j ) , φ ( σ ∗ ( j ))) ≤ d p ( j, σ ∗ ( j )) + d p ( σ ∗ ( j ) , σ ( j )) ≤ d p ( j, σ ∗ ( j )) + d p ( j, σ ∗ ( j )) + d p ( j, σ ( j )) = 2 d p ( j, σ ∗ ( j )) + d p ( σ ( j ) , j ) . So the decrease in the connection cost of j is d p ( j, σ ( j )) − d p ( j, φ ( σ ∗ ( j ))) ≥ − d p ( j, σ ∗ ( j )) = − d p ( j, S ∗ ) . • For a point j ∈ σ − ( A ) ∩ O ∗ , we reconnect j arbitrarily, and the decrease in the connection costof j is at least d p ( j, σ ( j )) − p = d ( j, S ) − p as p is the diameter of the metric d p .As the test swap operation is not ρ -efﬁcient, we have (cid:88) j ∈ σ ∗− ( A ∗ ) \ O ∗ ( d p ( j, S ) − d p ( j, S ∗ )) − (cid:88) j ∈ σ − ( A ) \ O ∗ d p ( j, S ∗ )+ (cid:88) j ∈ σ − ( A ) ∩ O ∗ ( d p ( j, S ) − p ) ≤ | A | ρ. (2)Above, we used that that σ − ( A ) \ σ ∗− ( A ∗ ) \ O ∗ ⊆ σ − ( A ) \ O ∗ .We now add up (2) over all test swap operations. We consider the expectation of the left side of thesummation, over all random choices of β : • The sum of the ﬁrst term on the left side of (2) is always exactly (cid:80) j ∈ C \ O ∗ (cid:0) d p ( j, S ) − d p ( j, S ∗ ) (cid:1) ,due to (P1). • Consider the expectation of the sum of the second term on the left side of (2). Since each i ∈ S is swapped out in at most (cid:96) times in expectation by (P2), the expectation of the sum of thesecond term is at least − (cid:0) (cid:96) (cid:1) (cid:80) j ∈ C \ O ∗ d p ( j, S ∗ ) .ii Consider the expectation of the sum of the third term on the left side of (2). Using that d p has diameter at most p , and (P2), the expectation is at least (cid:0) (cid:96) (cid:1) (cid:80) j ∈ O ∗ ( d p ( j, S ) − p ) ≥ (cid:80) j ∈ O ∗ d p ( j, S ) − (cid:0) (cid:96) (cid:1) | O ∗ | p . We changed the coefﬁcient before a non-negative term from (cid:0) (cid:96) (cid:1) to in the inequality; this is sufﬁcient.Overall, the expectation of the sum of the left side of (2) over all test swap operations is at least (cid:88) j ∈ C \ O ∗ (cid:0) d p ( j, S ) − d p ( j, S ∗ ) (cid:1) − (cid:18) (cid:96) (cid:19) (cid:88) j ∈ C \ O ∗ d p ( j, S ∗ ) + (cid:88) j ∈ O ∗ d p ( j, S ) − (cid:18) (cid:96) (cid:19) | O ∗ | p = (cid:88) j ∈ C d p ( j, S ) − (cid:18) (cid:96) (cid:19) (cid:88) j ∈ C \ O ∗ d p ( j, S ∗ ) − | O ∗ | · (cid:18) (cid:96) (cid:19) p = (cid:88) j ∈ C d p ( j, S ) − (cid:88) j ∈ C min (cid:26)(cid:18) (cid:96) (cid:19) d p ( j, S ∗ ) , (cid:18) (cid:96) (cid:19) p (cid:27) , where the last equality used the deﬁnition of O ∗ .The summation of the right side of (2) over all test swaps is always exactly kρ . Therefore, we have (cid:88) j ∈ C d p ( j, S ) − (cid:88) j ∈ C min (cid:26)(cid:0) (cid:96) (cid:1) d p ( j, S ∗ ) , (cid:0) (cid:96) (cid:1) p (cid:27) ≤ kρ. Rearranging the terms and replacing (cid:80) j ∈ C d p ( j, S ) with cost p ( S ) ﬁnish the proof of the theorem. B Missing Proofs from Section 3.1

Claim 7.

At any moment immediately after the while loop (Loop 4), we have (1 − (cid:15) ) cost p ( S ) ≤ (cid:0) (cid:96) ) opt + (cid:0) (cid:96) (cid:1) zp .Proof. After the while loop, no (cid:15) · cost p ( S ) k -efﬁcient swaps can be performed. Applying Theorem 5with S ∗ being the optimum solution for the k - Med - O instance at the moment, we have cost p ( S ) ≤ . (cid:80) j ∈ C min (cid:8)(cid:0) (cid:96) (cid:1) d p ( j, S ∗ ) , (cid:0) (cid:96) (cid:1) p (cid:9) + k · (cid:15) · cost p ( S ) k ≤ (cid:0) (cid:96) (cid:1) opt + (cid:0) (cid:96) (cid:1) zp + (cid:15) · cost p ( S ) .Moving (cid:15) · cost p ( S ) to the left side gives the claim. Lemma 8.

At any moment, we have p ≤ (cid:96) +2) opt γ ( (cid:96) +1) z .Proof. The statement holds at the beginning since opt = 0 . and p ≤ . . As opt can only increaseduring the algorithm, it sufﬁces to prove the inequality at any moment after we run Step 7; this isthe only step in which we increase p . We assume z ≥ since if z = 0 the lemma is trivial.Focus on any moment before we run Step 7. We deﬁne p ∗ > to be the real number such that (cid:0) (cid:96) (cid:1) (1+ γ ) zp ∗ = (cid:0) (cid:96) (cid:1) opt + (cid:0) (cid:96) (cid:1) zp ∗ . Then, if p > p ∗ , then the condition in Step 6 does nothold: Otherwise, we have (1 − (cid:15) ) cost p ( S ) > (1 − (cid:15) ) · − (cid:15) (cid:0) (cid:96) (cid:1) (1+ γ ) zp ≥ (cid:0) (cid:96) (cid:1) opt + (cid:0) (cid:96) (cid:1) zp ,contradicting Claim 7. Since we assumed we are going to run Step 7, we have p ≤ p ∗ . So, afterStep 7, we have p ≤ p ∗ = 2 · (3+2 /(cid:96) ) opt γ (1+1 /(cid:96) ) z = (cid:96) +2) opt γ ( (cid:96) +1) z . C Missing Proofs from Section 3.2

Claim 9.

At any moment, we have p ≤ O (1) · opt (cid:48) .Proof. Again it sufﬁces to show the inequality at any moment after we run Step 7. Suppose we justcompleted the while loop. Applying Theorem 5 with S ∗ being the optimum solution for the current k -median instance with metric d p , we have cost p ( S ) ≤ − (cid:15) (cid:0) (cid:96) (cid:1) opt (cid:48) . If at the moment we have p > − (cid:15) (cid:0) (cid:96) (cid:1) opt (cid:48) , then the condition for Step 6 will not be satisﬁed, even if z = 0 . So, beforeiiie run Step 7, we must have p ≤ − (cid:15) (cid:0) (cid:96) (cid:1) opt (cid:48) . After the step, we have p ≤ − (cid:15) (cid:0) (cid:96) (cid:1) opt (cid:48) = O (1) · opt (cid:48) . Lemma 10.

For every T ∈ [ τ, τ (cid:48) ] , we have (cid:80) Tt = τ ∆ t ≤ O ( k log n ) opt (cid:48) T .Proof. We can show that ∆ τ ≤ O (1) opt (cid:48) τ ≤ O (1) opt (cid:48) T by applying Theorem 5 with S ∗ being theoptimum solution that deﬁnes opt (cid:48) τ . Thus it sufﬁces to bound (cid:80) Tt = τ +1 ∆ t .Let S ∗ be the optimum solution for the k -median instance with point set { j , j , · · · , j T } and metric d p . We are only interested in points j τ +1 , j τ +2 , · · · , j T in the analysis. Fix any i ∗ ∈ S ∗ . Let { j t , j t , · · · , j t s } be the set of points in { j τ +1 , j τ +2 , · · · , j T } connected to i ∗ in the solution S ∗ ,where τ < t < t < · · · < t s ≤ T ≤ τ . For notation convenience, we let j (cid:48) r = j t r and ∆ (cid:48) r = ∆ t r for every r ∈ [ s ] . We now bound (cid:80) sr =1 ∆ (cid:48) r . We assume s ≥ since otherwise the quantity is .We can bound ∆ (cid:48) by p , and by Claim 9, we have ∆ (cid:48) ≤ p ≤ O (1) · opt (cid:48) τ ≤ O (1) · opt (cid:48) T . Then wewill bound ∆ (cid:48) r for any integer r ∈ [2 , s ] . Using Theorem 5, we can show that at the beginning oftime t r (or equivalently, at the end of time t r − ), we have cost p ( S ) ≤ O (1) · opt (cid:48) t r − ≤ O (1) · opt (cid:48) T .For the S , we have (cid:80) r − u =1 (cid:0) d p ( j (cid:48) u , S ) + d p ( j (cid:48) u , i ∗ ) (cid:1) ≤ O (1) opt (cid:48) T . The inequality holds since the summation for each of the two terms is at most O (1) opt (cid:48) T . So, there isat least one point j (cid:48) u such that d p ( j (cid:48) u , S )+ d p ( j (cid:48) u , i ∗ ) ≤ O (1) · opt (cid:48) T r − , implying d p ( i ∗ , S ) ≤ O (1) · opt (cid:48) T r − .Therefore, we have ∆ (cid:48) r ≤ d p ( i ∗ , j (cid:48) r ) + d p ( i ∗ , S ) ≤ d p ( i ∗ , j (cid:48) r ) + O (1) · opt (cid:48) T r − . Then (cid:80) sr =1 ∆ (cid:48) r ≤ O (1) · opt (cid:48) T + (cid:80) sr =2 (cid:16) d p ( i ∗ , j (cid:48) r ) + O (1) · opt (cid:48) T r − (cid:17) ≤ (cid:80) sr =1 d p ( i ∗ , j (cid:48) r ) + O (log s ) opt (cid:48) T = O (log T ) opt (cid:48) T = O (log n ) opt (cid:48) T . Considering all the k medians i ∗ ∈ S ∗ together, we have (cid:80) Tt = τ +1 ∆ t ≤ O ( k log n ) opt (cid:48) T . Lemma 11.

1) ln a H a / ( H −

1) + α = α (cid:18) ln a H a + 1 (cid:19) . The inequality in the second line used the following fact: if the product of H − positive numbersis a a H , then their sum is minimized when they are equal. The inequality in the third line used that − e − x ≤ x for every x . D Handling the F = C Setting

When F = C , a small issue with the analysis is that opt and opt (cid:48) may decrease as the algorithmproceeds. However, it can only decrease by at most a factor of from a moment to any latermoment. This holds due to the following fact: If we have a star ( i, C (cid:48) ) and any metric d (cid:48) , we have min j ∗ ∈ C (cid:48) (cid:80) j ∈ C (cid:48) d ( j ∗ , j ) ≤ (cid:80) j ∈ C (cid:48) d (cid:48) ( i, j ) . That is, including additional medians in F on top of F = C can only save a factor of . ivo address the issue, we deﬁne opt to be the optimum value of the current k - Med - O instance. Wedeﬁne opt at any moment of the algorithm to be the maximum opt we see until the moment. Thenat any moment of the algorithm, we have opt ≤ opt ≤ opt . Moreover opt can only increase as thealgorithm proceeds. Claim 7 still holds, and Lemma 8 holds with opt replaced by opt or opt . Theneventually we shall get a bifactor of (cid:16) − (cid:15) (cid:0) (cid:96) (cid:1) (1 + γ ) , − (cid:15) (3 + (cid:96) ) (cid:0) γ (cid:1)(cid:17) .We can use the same trick to handle opt (cid:48) in the analysis of the recourse. In this case, the factor of will be hidden in the O ( · ) notation and thus the recourse bound is not affected. More precisely, wedeﬁne opt (cid:48) to be the maximum opt (cid:48) we see until the moment. Then, we always have opt (cid:48) ≤ opt (cid:48) ≤ opt (cid:48) , and opt (cid:48) can only increase. Claim 9 still holds. Then we ﬁx a stage whose p value is p andassume the stage starts in time τ and ends in time τ (cid:48) . For every t ∈ [ τ, τ (cid:48) ] , deﬁne opt (cid:48) t to be thevalue of opt (cid:48) at any moment that is in the stage and after Step 3 at the time t . Then Lemma 10 stillholds and in the end we can bound the recourse by O (cid:16) k log n log( nD ) (cid:15) (cid:17) . Thus we proved Lemma 6. E Additional Experiment Results for Incremental z setting Here we include experiment results for the incremental z setting where the number of outliers z changes with time. We let z grow uniformly as follows: we still focus on the ﬁrst 10K data points,and for each time t ∈ [1 , , we set the number of allowed outliers z t = t × . So asmore data points come, we allow to remove more outliers. All other parameters are the same as insection 4: (cid:15) = 0 . , γ = 1 , k ∈ { , , } , and available center locations F = C . n : r ec o u r s e k = 10 k = 50 k = 100 (a) S KIN n : k = 10 k = 50 k = 100 (b) C OVERTYPE n : k = 10 k = 50 k = 100 (c) L ETTER