Online Coresets for Clustering with Bregman Divergences
Rachit Chhaya, Jayesh Choudhari, Anirban Dasgupta, Supratim Shit
OO NLINE C ORESETS FOR C LUSTERING WITH B REGMAN D IVERGENCES
A P
REPRINT
Rachit Chhaya
IIT Gandhinagar [email protected]
Jayesh Choudhari
University of Warwick [email protected]
Anirban Dasgupta
IIT Gandhinagar [email protected]
Supratim Shit ∗ IIT Gandhinagar [email protected]
December 14, 2020 A BSTRACT
We present algorithms that create coresets in an online setting for clustering problems according to awide subset of Bregman divergences. Notably, our coresets have a small additive error, similar inmagnitude to the lightweight coresets [1], and take update time O ( d ) for every incoming point where d is dimension of the point. Our first algorithm gives online coresets of size ˜ O ( poly ( k, d, (cid:15), µ )) for k -clusterings according to any µ -similar Bregman divergence. We further extend this algorithm toshow existence of a non-parametric coresets, where the coreset size is independent of k , the numberof clusters, for the same subclass of Bregman divergences. Our non-parametric coresets are largerby a factor of O (log n ) ( n is number of points) and have similar (small) additive guarantee. Atthe same time our coresets also function as lightweight coresets for non-parametric versions of theBregman clustering like DP-Means. While these coresets provide additive error guarantees, theyare also significantly smaller (scaling with O (log n ) as opposed to O ( d d ) for points in R d ) thanthe (relative-error) coresets obtained in [2] for DP-Means. While our non-parametric coresets areexistential, we give an algorithmic version under certain assumptions. K eywords Online · Streaming · Coreset · Scalable · Clustering · Bregman divergence · Non-Parametric
Clustering is perhaps one of the most frequently used operations in data processing, and a canonical definition of theclustering problem is via the k -median, in which propose k possible centers such that the sum of distances of every pointto its closest center is minimized. There has been a plethora of work, both theoretical and practical, devoted to findingefficient and provable clustering algorithms in this k -median setting. However, most of this literature is devoted towardsdissimilarity measures that are algorithmically easier to handle, namely the various (cid:96) p norms, especially Euclidean.However, other dissimilarity measures, e.g. Kullback Leibler or Itakuro-Saito, are often more appropriate based on thedata. A mathematically elegant family of dissimilarity measures that have found wide use are the Bregman divergences ,which include, for instance, the squared Euclidean distance, the Mahalanobis distance, Kullbeck-Leibler divergence,Itakuro-Saito dissimilarity and many others.While being mathematically satisfying, the chief drawback of working with Bregman divergences is algorithmic—most of these divergences do not satisfy either symmetry or triangle inequality conditions. Hence, developing efficientclustering algorithms for these has been a much harder problem to tackle. Banerjee et. al. [3] has done systematic studyof the k -median clustering problem under Bregman divergence, and proposed algorithms that are generalization of the ∗ Corresponding author a r X i v : . [ c s . D S ] D ec PREPRINT - D
ECEMBER
14, 2020Lloyd’s iterative algorithm for the Euclidean k -means problem. However, scalability remains a major issue. Given thatthere are no theoretical bounds on the convergence of the Lloyd’s algorithm in the general Bregman setting, a decentsolution is often achieved only via running enough iterations as well as searching over multiple initializations. This isclearly expensive when the number of data points and the data dimension is large.Coresets, a data summarization technique to enable efficient optimization, has found multiple uses in many problems,especially in computational geometry and more recently in machine learning via randomized numerical linear algebraictechniques. The aim is to judiciously select (and reweigh) a set of points from the input points, so that solving theoptimization problem on the coreset gives a guaranteed approximation to the optimization problem on the full data.In this work we explore the use of coresets to make Bregman clustering more efficient. Our aim is to give coresets thatare small, the dependence on the number of points as well as the dimension should be linear or better – it is not aprioriclear that this can be achieved for all Bregman divergences. Most coresets for k -means and various linear algebraicproblems, while being sublinear in the number of points, are often super-linear in the dimension of the data. However,in big-data setups, it is fairly common to have the number of dimensions to be almost of the same order of magnitude asthe number of points. Coresets that trade-off being sublinear in the number of points while increasing the dependenceon the dimension (to, say, exponential) might not be desirable in such scenarios.A further complication is the dependence of the coreset size on the number of clusters, as k , the number of clusters canbe large, and more importantly, it can be unknown, to be determined only after exploratory analysis with clustering.When the number of clusters is unknown, it is unclear how to apply a coreset construction that needs knowledge of k .Recent work by Huang et. al. [4] shows that for relative error coresets for Euclidean k -means, a linear dependence ofcoreset size on k is both sufficient and inevitable.In this work, we tackle these questions for Bregman divergences. We develop coresets with small additive errorguarantees . Such results have been obtained in the Euclidean setting by Bachem et. al. [1], and in the online subspaceembedding setting by Cohen et. al. [5]. We next show the existence of non-parametric coresets, where the coresetsize is independent of k , the parameter representing number of cluster centers. We utilize the sensitivity frameworkof [6] jointly with the barrier functions method of [7] in order to achieve this. A non-parametric coreset will be usefulin problems such as DP-Means clustering [2] and extreme clustering [8]. We now formally describe the setup and listour contributions.Given A ∈ R n × d where the rows (aka points) arrive in streaming fashion, let A i ∈ R i × d represent the first i points thathave arrived and C i be the coreset maintained for A i . Let ϕ i denote the mean point of A i , i.e., ϕ i = (1 /i ) (cid:80) j ≤ i a j and ϕ is the mean of A . Let X ∈ R k × d denote a candidate set of k centers in R d and lets f X ( A i ) be the total sumof distances of each point a ∈ A i from its closest center in X , according to a chosen Bregman divergence. We givealgorithms which return coreset C that ensures the following for any X , | f X ( C ) − f X ( A ) | ≤ (cid:15) ( f X ( A ) + f ϕ ( A )) (1)The following are our main contributions.• We give an algorithm named ParametricFilter ( Algorithm (1) ) which ensures property (1) for any X ∈ R k × d with at least . probability. ParametricFilter returns a coreset C for A . It takes O ( d ) update time and uses O ( d ) working space. The expected size of the coreset C is O (cid:16) dk log(1 /(cid:15) ) (cid:15) µ (cid:0) log n +log (cid:0) f ϕ ( A ) (cid:1) − log (cid:0) f ϕ ( a ) (cid:1)(cid:1)(cid:17) (Theorem 4.1). Here, f x ( · ) is a µ -similar Bregman divergence to somesquared Mahalanobis distance.• For the special case of k -means clustering, ParametricFilter builds online coreset C for A which ensuresproperty (1) for all X ∈ R k × d with at least . probability. The update time and working space are O ( d ) .The expected coreset size is O (cid:16) dk log(1 /(cid:15) ) (cid:15) (cid:0) log n + log (cid:0) f ϕ ( A ) (cid:1) − log (cid:0) f ϕ ( a ) (cid:1)(cid:1)(cid:17) (Corollary 4.1).• We show that it is impossible to get a non-parametric coreset for clustering problem which ensures a relativeerror approximation to the optimal cost (Theorem 5.1). We then give an existential result for non-parametriccoreset with small additive error approximation. For this we present a method DeterministicFilter (2)which uses an oracle while taking the sampling decision. The coreset ensures (1) for any X with at most n centers in R d . Hence we call non-parametric coreset. The algorithm returns coreset of size O (cid:16) log n(cid:15) µ (cid:0) log n +log (cid:0) f ϕ ( A ) (cid:1) − log (cid:0) f ϕ ( a ) (cid:1)(cid:1)(cid:17) (Theorem 5.2). Here f X ( · ) is a µ -similar Bregman divergence to somesquared Mahalanobis distance. 2 PREPRINT - D
ECEMBER
14, 2020• Again for special case of clustering based on squared euclidean distance,
DeterministicFilter buildscoreset C for A which ensures (1) for any X with at most n centers in R d . The method returns a coreset ofsize O (cid:16) log n(cid:15) (cid:0) log n + log (cid:0) f ϕ ( A ) (cid:1) − log (cid:0) f ϕ ( a ) (cid:1)(cid:1)(cid:17) (Corollary 5.1).• The coresets from DeterministicFilter can be considered as non-parametric coreset for DP-Meansclustering (Theorem 5.3). The coreset size is O (cid:16) log nµ (cid:15) (cid:0) log n + log (cid:0) f ϕ ( A ) (cid:1) − log (cid:0) f ϕ ( a ) (cid:1)(cid:1)(cid:17) opposed to O ( d d k ∗ (cid:15) − ) [9], where k ∗ is the optimal centers for DP-Means clustering.• Under certain assumptions, we propose an algorithm named NonParametricFilter ( Algorithm (3) ) whichcreates a coreset for non-parametric clustering.Except for the existential result, the above contributions can also be made true in the online setting, i.e., at every point i ∈ [ n ] the set C i maintained for A i ensures the guarantee with some constant probability by taking a union boundover all i ∈ [ n ] . Note that this is a stronger guarantee and in this case the expected sample size gets multiplied by afactor of O (log n ) . Here we define the notation that we use in rest of the paper. First n natural number set is represented by [ n ] . A boldlower case letter denotes a vector or a point for e.g. a , and a bold upper case letter denotes a matrix or set of points asdefined by the context for e.g. A . In general A has n points each in R d . a i denotes the i th row of matrix A and a j denotes its j th column. We use the notation A i to denote the matrix or a set, formed by the first i rows or points of A seen till a time in the streaming setting. Definition 2.1.
Bregman divergence:
For any strictly convex, differentiable function
Φ :
Z → R , the Bregmandivergence with respect to Φ , ∀ x , y ∈ Z is, d Φ ( y , x ) = Φ( y ) − Φ( x ) − ∇ Φ( x ) T ( y − x ) We also denote f x ( y ) = d Φ ( y , x ) . Throughout the paper for some set of centers X in R d and point a ∈ R d we consider f X ( a ) as a cost function based on Bregman divergence. We define it as f X ( a ) = min x ∈ X f x ( a ) = min x ∈ X d Φ ( a , x ) ,where d Φ ( · ) is some Bregman divergence as defined above. If the set of points A have weights { w a } then we define f x ( a ) = w a d Φ ( a , x ) .The Bregman divergence d Φ is said to be µ -similar if it satisfies the following property– ∃ M (cid:31) such that, if d M ( y , x ) = ( y − x ) T M ( y − x ) denotes the squared Mahalanobis distance measure for M , then for all x , y , µd M ( y , x ) ≤ d Φ ( y , x ) ≤ d M ( y , x ) .Going forward, we also denote f Mx ( a ) = d M ( a , x ) , and hence, we have µf Mx ( a ) ≤ f x ( a ) ≤ f Mx ( a ) , ∀ x and ∀ a ∈ A .Due to this we say f x ( · ) and f Mx ( · ) are µ similar. For Euclidean k -means clustering M is just an identity matrix and µ = 1 . It is known that a large set of Bregman divergences is µ -similar, including KL-divergence, Itakura-Saito, RelativeEntropy, Harmonic etc [9]. In Table 1, we list the most common µ -similar Bregman divergences, their corresponding M and the µ . In each case the λ and ν refer to the minimum and maximum values of all coordinates over all points, i.e.the input is a subset of [ λ, ν ] d . Table 1: µ -similar Bregman divergences Divergence µ M Squared-Euclidean I d Mahalanobis N N Exponential-Loss e − ( ν − λ ) e ν I d Kullback-Leibler λν λ I d Itakura-Saito λ ν λ I d Harmonic α ( α > λ α +2 ν α +2 α (1 − α )2 λ α +2 I d Norm-Like α ( α > λ α − ν α − α (1 − α )2 ν α − I d Hellinger-Loss − ν ) / − ν ) − / I d For the Bregman divergence clustering problem, the set X , called the query set, will represent the set of all possiblecandidate centers. There are two types of clustering, hard and soft clustering for Bregman divergence [3]. In this work,by the term clustering, we refer only to the hard clustering problem.3 PREPRINT - D
ECEMBER
14, 2020
Coresets:
A coreset acts as a small proxy for the original data in the sense that it can be used in place of the originaldata for a given optimization problem in order to obtain a provably accurate approximate solution to the problem.Formally, for a non-negative cost function f X ( a ) with query X and data point a ∈ A , a set of subsampled andappropriately reweighted points C is coreset if ∀ X , | (cid:80) a ∈ A f X ( a ) − (cid:80) ˜ a ∈ C f X (˜ a ) | ≤ (cid:15) (cid:80) a ∈ A f X ( a ) for some (cid:15) > .While coresets are typically defined for relative errors, additive error coresets can also be defined similarly. For (cid:15), γ > , C is an additive ( (cid:15), γ ) coreset of A if C contains reweighted points from A , and ∀ X , | (cid:80) a ∈ A f X ( a ) − (cid:80) ˜ a ∈ C f X (˜ a ) | ≤ (cid:15) (cid:80) a ∈ A f X ( a ) + γ . The coresets that are presented here satisfies such additive guarantees.For a dataset A , a query space X that denotes candidate solutions to an optimization problem, and a cost function f X ( · ) , [10] define sensitivity scores that capture the relative importance of each point for the problem and can be usedto construct a probability distribution. The coreset is then created by sampling points according to this distribution. Thesensitivity of a point a is defined as s a = sup X ∈X f X ( a ) (cid:80) a (cid:48)∈ A f X ( a (cid:48) ) . Lightweight Coresets:
Lightweight coresets were introduced by [1] for clustering based on µ -similar Bregmandivergence. These coresets give an additive error guarantee and they were built based on sensitivity framework. For adataset A ∈ R n × d , and a cost function f X ( · ) for some X ∈ R k × d the sensitivity of a point a is s a = sup X f X ( a ) f X ( A )+ f ϕ ( A ) .Here ϕ is the mean point of the entire dataset A i.e., ϕ = (cid:80) i ≤ n a i /n and f ϕ ( A ) = (cid:80) a ∈ A f ϕ ( a ) .In this work, we define C to be an ( (cid:15), γ ) -additive error non-parametric coreset if its size is independent of k (numberof centres) and ensures | f X ( C ) − f X ( A ) | ≤ (cid:15)f X ( A ) + γ for any query X ∈ R k × d for all integers k ∈ [ n ] .We use the following theorems in this paper. Theorem 2.1.
Bernstein’s inequality [11]
Let the scalar random variables x , x , ··· , x n be independent that satisfy ∀ i ∈ [ n ] , | x i − E [ x i ] | ≤ b . Let X = (cid:80) i x i and let σ = (cid:80) i σ i be the variance of X . Then for any t > ,Pr (cid:0) X > E [ X ] + t (cid:1) ≤ exp (cid:18) − t σ + bt/ (cid:19) Theorem 2.2. [12]
Let A be the dataset, X be the query space of dimension D , and for x ∈ X , let f x ( · ) be the costfunction. Let s j be the sensitivity of the j th row of A , and the sum of sensitivities be S . Let ( (cid:15), δ ) ∈ (0 , . Let r besuch that r ≥ O (cid:16) S(cid:15) ( D log 1 (cid:15) + log 1 δ ) (cid:17) C be a matrix of r rows, each sampled i.i.d from A such that each ˜ a i ∈ C is chosen to be a j , with weight Srs j , withprobability s j S , for j ∈ [ n ] . Then C is an (cid:15) -coreset of A for function f () , with probability at least − δ . We use the above Theorem to bound the coreset size. Note that the Theorem considers a multinomial sample wherea point ˜ a i in coreset C is a j and weight Srs j for j ∈ [ n ] with probability s j S . Instead in our approach we get ˜ a i as a i ,with weight / min { , rs i } , with probability min { rs i , } or it is ∅ , with weight , with probability − min { rs i , } .However, the same Theorem as above applies. The term coreset was first introduced in [13] and there has been a significant amount of work on coresets sincethen. Interested readers can look at [14, 15] and the references therein. Using sensitivities to construct coresets wasintroduced in [10] and further generalized by [6]. Coresets for clustering problems such as k -means clustering have beenextensively studied [1, 16, 17, 18, 19, 20, 21, 22]. In [17] the authors reduce the k-means problem to a constrained lowrank approximation problem. They show that a constant factor approximation can be achieved by just O ( (cid:15) − log k ) sizecoreset and for (1 ± (cid:15) ) relative error approximation they give coreset of size O ( k(cid:15) − ) . In [19, 20], the authors discussa deterministic algorithm for creating coresets for clustering problem which ensure a relative error approximation.The streaming version of [19] returns a coreset of size O ( k (cid:15) − (cid:15) − log n ) which ensures a (1 ± (cid:15) log n ) relative errorapproximation. Feldman et. al. [20] reduce the problem of k -means clustering to (cid:96) frequent item approximation.The streaming version of the algorithm returns a coreset size of O ( k (cid:15) − log n ) . In [21] authors give an algorithmwhich returns a one shot coreset for all p euclidean distance k-clustering problem, where p ∈ [1 , p max ] . Their algorithmcreates a grid over the range [1 , p max ] and based on the sensitivity at each grid point the coreset is built. It returns acoreset of size ˜ O (16 p max dk ) for which it takes ˜ O ( ndk ) ensuring (1 ± (cid:15) ) relative error approximation. In a slightlydifferent line [23] gives a deterministic algorithm for feature selection in k-means problem. In [1], the authors givean algorithm to create a create coreset which only takes O ( nd ) time and returns a coreset of size O ( dk(cid:15) − log k ) at a4 PREPRINT - D
ECEMBER
14, 2020cost of small additive error approximation. Their algorithm can further be extended for clustering based on Bregmandivergences which are µ -similar to squared Mahalanobis distance. In [24] the authors give algorithms to create suchcoresets for both hard and soft clustering based on µ -similar Bregman Divergence.There are several online algorithms for k-means clustering [25, 26, 27]. In [25], the authors give an online algorithmthat maintains a set of centers such that k-means cost on these centres is ˜ O ( W ∗ ) where W ∗ is the optimal k-means cost.[26] improves this result and gives a robust algorithm which can also handle outliers in the dataset.For our analysis we use theorem 3.2 in [12], where authors show that the coreset built using sensitivity framework hasa sampling complexity that only depends on O ( S ) instead of O ( S log( S )) as in [18] but with an additional factor of log(1 /(cid:15) ) . Due to this, our coreset size for clustering based on µ -similar Bregman divergence only has dependence of O (1 /µ ) , unlike in [1, 24] where the dependence is O (1 /µ ) . Here we state our first algorithm
ParametricFilter which creates a coreset in an online manner for clustering basedon Bregman divergence, i.e., for the i th incoming point we take the sampling decision without looking at the ( i + 1) th point. The algorithm starts with knowledge of the Bregman divergence d Φ () . It is important to note that for a fixed d Φ , as A changes, both M and µ also change [9, 24] (table 1). Fortunately, updating the Mahalanobis matrix requiresmaintaining only two simple statistic of the data. On arrival of the input point a i , the algorithm first updates both theMahalanobis matrix M i as well as µ i and then uses it to compute the upper bound for the sensitivity score. This scoreis then used to decide whether a i should be stored in the coreset. If selected, the point a i is stored with an appropriateweight ω i . Algorithm 1
ParametricFilter
Require:
Streaming points a i , i = 1 , , . . . , n ; r > Ensure: ( Coreset C , Weights Ω) C = Ω = ϕ = ∅ ; S = 0 λ = (cid:107) a (cid:107) min ; ν = (cid:107) a (cid:107) max while i ≤ n do λ = min { λ, (cid:107) a i (cid:107) min } ; ν = max { ν, (cid:107) a i (cid:107) max } Update M i ; µ i = λ/νϕ i = (( i − ϕ i − + a i ) /i ; S = S + f M i ϕ i ( a i ) if i = 1 then p i = 1 else l i = f M iϕi ( a i ) µ i S + µ i ( i − ; p i = min { , rl i } end if Set c i and ω i as (cid:26) a i and /p i w. p. p i ∅ and else ( C i , Ω i ) = ( C i − , Ω i − ) ∪ ( c i , ω i ) end while Return ( C , Ω) Notice that as working space the algorithm only needs to maintain the current mean ϕ i , the diagonal matrix M i , and thevalues S, λ , and ν . For the case when d Φ is Mahalanobis distance, we consider that M and µ are know a priori and thealgorithm uses them for points a i . Hence, in the case of Mahalanobis distance, the update time and the working spaceare both O ( d ) . For all other divergences in Table 1 however, the matrix M is diagonal and hence both the update timeand the working space are O ( d ) only.Let A i be the dataset formed by first i data points. Algorithm ( ParametricFilter
1) updates M i and µ i for A i . By acareful analysis, we show that even when these are updated online, we achieve a one pass online algorithm that createsan additive error coreset.Before stating the main results, we now give some intuition why updating the Mahalanobis matrix works. Noticethat, for every incoming point, ParametricFilter maintains a positive definite matrix M i , a range [ λ, ν ] and themean of A i as ϕ i . Here λ is the smallest absolute value in A i , i.e., λ = (cid:107) A i (cid:107) min and ν is the highest absolute value5 PREPRINT - D
ECEMBER
14, 2020in A i , i.e., ν = (cid:107) A i (cid:107) max . With this λ and ν the algorithm computes M i and µ i as per Table 1. Hence we have, µ i f M i X ( a j ) ≤ f X ( a j ) ≤ f M i X ( a j ) , ∀ X and ∀ a j ∈ A i .We note the following useful observation that is immediate, based on the formula for the matrix M and the scalar µ inthe Table 1. The lemma applies for all Bregman divergence, but Mahalanobis distance and we use the lemma to showthe algorithm’s correctness. Lemma 4.1.
For all Bregman divergences in Table 1, for j ≤ i , µ j ≥ µ i and M j (cid:22) M i .Proof. At any i th point we have λ = (cid:107) A i (cid:107) min and ν = (cid:107) A i (cid:107) max , i.e., the smallest and largest absolute values in A i . Further we have (cid:107) A j (cid:107) min ≥ (cid:107) A i (cid:107) min and (cid:107) A j (cid:107) max ≤ (cid:107) A i (cid:107) max for j ≤ i . By using the formula for M for allBregman divergences given in Table 1 we have M j (cid:22) M i and µ j ≥ µ i to be always true for j ≤ i .By a careful analysis in the following lemma, we show that the scores l i defined in ParametricFilter , upper boundthe lightweight sensitivity scores of a i with respect to A i − , and that the sum of l i ’s is bounded. Lemma 4.2.
For points coming in streaming manner ∀ i ∈ [ n ] , the l i defined in ParametricFilter , upper boundsthe lightweight sensitivity score: sup X ∈X f X ( a i ) f X ( A i − ) + f ϕ i ( A i ) (2) Furthermore, (cid:80) i ≤ n l i ≤ (8 log n + 4 log (cid:0) f M ϕ ( A ) (cid:1) − (cid:0) f M ϕ ( a ) (cid:1) /µ .Proof. At A i , let ( µ i , M i ) be such that µ i f M i x ( a j ) ≤ f x ( a j ) ≤ f M i x ( a j ) and ϕ i = (cid:80) j ≤ i a j i . For any X ∈ R k × d ,each point a j ∈ A i − has some closest point x l ∈ X . Hence for such pair { a j , x l } , we have f M i x l ( ϕ i ) ≤ f M i x l ( a j ) +2 f M i ϕ i ( a j ) . So ( i − f X ( ϕ i ) ≤ (cid:80) a j ∈ A i − ( f X ( a j ) + f ϕ i ( a j )) = 2 f X ( A i − ) + 2 f ϕ i ( A i − ) . We use this triangleinequality in the following analysis, which holds ∀ X ∈ R k × d , f X ( a i ) f X ( A i − ) + f ϕ i ( A i ) ( i ) ≤ f M i X ( a i ) f X ( A i − ) + f ϕ i ( A i ) ≤ f M i ϕ i ( a i ) + 2 f M i X ( ϕ i ) f X ( A i − ) + f ϕ i ( A i ) ≤ f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − ) + i − f M i X ( A i − ) f X ( A i − ) + f ϕ i ( A i ) ( ii ) ≤ f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − ) + i − f M i X ( A i − ) µ i ( f M i X ( A i − ) + f M i ϕ i ( A i ))= 2 f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − ) µ i ( f M i X ( A i − ) + f M i ϕ i ( A i )) + i − f M i X ( A i − ) µ i ( f M i X ( A i − ) + f M i ϕ i ( A i )) ≤ f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − ) µ i ( f M i X ( A i − ) + f M i ϕ i ( A i )) + 4 µ i ( i − ≤ f M i ϕ i ( a i ) µ i f M i ϕ i ( A i ) + i − f M i ϕ i ( A i − ) µ i f M i ϕ i ( A i ) + 4 µ i ( i − ( iii ) ≤ f M i ϕ i ( a i ) µ i f M i ϕ i ( A i ) + 8 µ i ( i − ≤ f M i ϕ i ( a i ) µ i (cid:80) j ≤ i f M j ϕ j ( a j ) + 8 µ i ( i − The inequality ( i ) is due to µ i similarity, i.e., f X ( a i ) ≤ f M i X ( a i ) . Next couple of inequalities are by applying triangleinequality on the numerator. In the ( ii ) inequality we use the µ i similarity lower bound on the denominator term. Wereach to ( iii ) inequality by upper bounding the second and third term by / ( µ i ( i − . In the final inequality we usethe lemma 4.1 from which we have M j (cid:22) M i for j ≤ i . Further by the property of Bregman divergence we know that6 PREPRINT - D
ECEMBER
14, 2020 ϕ i − = arg min x f x ( A i − ) , so we have f ϕ i ( A i ) = f ϕ i ( A i − ) + f ϕ i ( a i ) ≥ f ϕ i − ( A i − ) + f ϕ i ( a i ) ≥ (cid:80) j ≤ i f ϕ j ( a i ) .Hence we have f M i ϕ i ( A i ) ≥ (cid:80) j ≤ i f M j ϕ j ( a j ) .Next, in order to upper bound (cid:80) i ≤ n l i , consider the denominator term of l i as follows, (cid:88) j ≤ i f M j ϕ j ( a j ) = (cid:88) j ≤ i − f M j ϕ j ( a j ) + f M i ϕ i ( a i )= (cid:88) j ≤ i − f M j ϕ j ( a j ) (cid:18) f M i ϕ i ( a i ) (cid:80) j ≤ i − f M j ϕ j ( a j ) (cid:19) ≥ (cid:88) j ≤ i − f M j ϕ j ( a j ) (cid:18) f M i ϕ i ( a i ) (cid:80) j ≤ i f M j ϕ j ( a j ) (cid:19) = (cid:88) j ≤ i − f M j ϕ j ( a j )(1 + q i ) ( i ) ≥ exp( q i / (cid:88) j ≤ i − f M j ϕ j ( a j )exp( q i / ≤ (cid:80) j ≤ i f M j ϕ j ( a j ) (cid:80) j ≤ i − f M j ϕ j ( a j ) where for inequality ( i ) we used that q i = f M iϕi ( a i ) (cid:80) j ≤ i f M jϕj ( a j ) ≤ and hence we have (1 + q i ) ≥ exp( q i / . Now as weknow that (cid:80) j ≤ i f M j ϕ j ( a j ) ≥ (cid:80) j ≤ i − f M j ϕ j ( a j ) hence the following product results into a telescopic product and weget, (cid:89) ≤ i ≤ n exp( q i / ≤ (cid:80) j ≤ n f M j ϕ j ( a j ) f M ϕ ( a ) So by taking logarithm of both sides we get (cid:80) ≤ i ≤ n q i ≤ (cid:0) f M ϕ ( A ) (cid:1) − (cid:0) f M ϕ ( a ) (cid:1) . Further incorporatingthe terms µ i ( i − we have l i = q i µ i + µ i ( i − . Hence, (cid:80) ≤ i ≤ n l i ≤ µ − (log n + log (cid:0) f M ϕ ( A ) (cid:1) − log (cid:0) f M ϕ ( a ) (cid:1) ) .Where µ = µ n ≤ µ i and M (cid:23) M n (cid:23) M i for all i ≤ n .Note that the upper bounds and the sum are independent of k , i.e., number of clusters one is expecting in the data. Thefollowing Lemma claims that by sampling enough points based on l i one can ensure the additive error coreset propertywith a high probability. Lemma 4.3.
For clustering with k centers in R d , with r = O (cid:16) dk log(1 /(cid:15) ) (cid:15) (cid:17) in ParametricFilter , the returnedcoreset C satisfies the guarantee as in (1) at i = n , ∀ X ∈ R k × d with at least . probability.Proof. For some fixed (query) X ∈ R k × d consider the following random variable. w i = (cid:26) (1 /p i − f X ( a i ) with probability p i − f X ( a i ) with probability (1 − p i ) Note that E [ w i ] = 0 and with p = 1 we get | w i | = 0 . The algorithm uses the sampling probability p i = min { rl i , } .Now we bound the term | w i | . In the case when p i < and a i is sampled we have, | w i | ≤ p i f X ( a i )= f X ( a i ) rl i ≤ ( f X ( A i − ) + f ϕ i ( A i )) f X ( a i ) rf X ( a i ) ≤ ( f X ( A ) + f ϕ ( A )) r PREPRINT - D
ECEMBER
14, 2020Here ϕ = ϕ n is the mean of the entire data A . Next if the point a i is not sampled then we know for sure that p i < ,hence, using the Lemma 4.2, we have that, > rl i ≥ rf X ( a i )( f X ( A i − ) + f ϕ i ( A i )) f X ( a i ) ≤ ( f X ( A ) + f ϕ ( A )) r So we have | w i | ≤ b = ( f X ( A ) + f ϕ ( A )) /r . Next we bound the var ( (cid:80) i ≤ n w i ) = (cid:80) i ≤ n E [ w i ] . Note that a singleterm E [ w i ] for p i < is, E [ w i ] = (cid:0) p i (1 /p i − + (1 − p i ) (cid:1) f X ( a i ) ≤ p i f X ( a i ) = f X ( a i ) rl i ≤ ( f X ( A i − ) + f ϕ i ( A i )) f X ( a i ) rf X ( a i ) ≤ f X ( a i )( f X ( A ) + f ϕ ( A )) r So we get, var (cid:16) (cid:88) i ≤ n w i (cid:17) = (cid:88) i ≤ n E [ w i ] ≤ (cid:88) i ≤ n f X ( a i )( f X ( A ) + f ϕ ( A )) r ≤ ( f X ( A ) + f ϕ ( A )) r Now by applying Bernstein’s inequality (2.1) on (cid:80) i ≤ n w i with t = (cid:15) ( f X ( A ) + f ϕ ( A )) we bound the probability P = Pr (cid:16) | ( f X ( A ) − f X ( C ) | ≥ (cid:15) ( f X ( A ) + f ϕ ( A )) (cid:17) as follows, P ≤ exp (cid:18) − (cid:15) ( f X ( A ) + f ϕ ( A )) (cid:15) ( f X ( A ) + f ϕ ( A )) / r + 2( f X ( A ) + f ϕ ( A )) /r (cid:19) = exp (cid:18) − r(cid:15) ( (cid:15)/ (cid:19) So to get the above event with at least . probability it is enough to set r to be θ (cid:16) (cid:15) (cid:17) . Note that the above isguaranteed for a fixed X ∈ R k × d .Now we show that coreset C can be made strong coreset by taking a union bound over a set of queries. To ensure theguarantee in Lemma 4.3 for all X ∈ R k × d , we take a union bound over the (cid:15)/ -net of R k × d [15, 24]. Such a net willhave at most O ( (cid:15) − dk ) queries. To ensure a strong a coreset guarantee it is enough to set r as Θ (cid:16) dk log(1 /(cid:15) ) (cid:15) (cid:17) .Utilizing Lemmas 4.2 and 4.3, where we take an union bound over the (cid:15) -net of the query space, we get the followingtheorem. Theorem 4.1.
For points coming in streaming fashion,
ParametricFilter returns a coreset C for the clusteringbased on Bregman divergence such that for all X ∈ R k × d , with at least . probability C ensures the guar-antee (1) . Such a coreset has expected sample size of O (cid:16) dk log(1 /(cid:15) ) µ(cid:15) (cid:0) log n + log (cid:0) f M ϕ ( A ) (cid:1) − log (cid:0) f M ϕ ( a ) (cid:1)(cid:1)(cid:17) . ParametricFilter takes O ( d ) update time and uses O ( d ) as working space. The expected sample size of the coreset C returned by ParametricFilter is bounded by r (cid:80) i ≤ n l i . Using Lemma4.2 and Lemma 4.3 we set the values of r and (cid:80) i ≤ n l i , and obtain the expected sample size to be O (cid:16) dk log(1 /(cid:15) ) µ(cid:15) (cid:0) log n + PREPRINT - D
ECEMBER
14, 2020 log (cid:0) f M ϕ ( A ) (cid:1) − log (cid:0) f M ϕ ( a ) (cid:1)(cid:1)(cid:17) . Further by using the µ -similarity one can rewrite the expected sample size as O (cid:16) dk log(1 /(cid:15) ) µ (cid:15) (cid:0) log n + log (cid:0) f ϕ ( A ) (cid:1) − log (cid:0) f ϕ ( a ) (cid:1)(cid:1)(cid:17) .The algorithm requires a working space of O ( d ) which is to maintain the mean(centre) ϕ i , µ i and M i . Further forevery incoming point ParametricFilter only needs to compute the distance between the point and the current meanhence the running time of the entire algorithm is O ( nd ) , which is why it is easy to scale for large n . Note that althoughTheorem 4.1 gives the guarantees of ParametricFilter at the last instance, but using the same analysis techniqueone can ensure an equivalent guarantee for any i th instance, but taking an union bound– this requires a factor of log( n ) for sample size, i.e. ensuring the guarantee (1) by C i for A i , ∀ i ∈ [ n ] .Note that ParametricFilter returns a smaller coreset C compare to offline coresets [21, 24] but at a cost of additivefactor approximation that depends on the structure of the data. Further unlike [1, 24] our sampling complexity onlydepends on /µ . ParametricFilter can be easily generalized to create coresets for weighted clustering where eachpoint a i has some weight w a i such that f X ( a i ) = w a i min x ∈ X d Φ ( a i , x ) . While sampling point (say a i ) the algorithm ParametricFilter sets c i = a i and ω i = w a i /p i with probability p i . k -means Clustering When the divergence d Φ () is squared Euclidean, the problem is k -means clustering. Here, ∀ i ∈ [ n ] we have M i = I d and µ i = 1 . So the algorithm ParametricFilter does not need to maintain M i and µ i . In the following corollary westate the guarantee of ParametricFilter for k-means clustering.
Corollary 4.1.
Let A ∈ R n × d such that the points are coming in streaming manner and fed to ParametricFilter ,it returns a coreset C which ensures the guarantee in equation (1) for all X ∈ R k × d with probability at least . .Such a coreset has O (cid:16) dk log(1 /(cid:15) ) (cid:15) (cid:0) log n + log (cid:0) f ϕ ( A ) (cid:1) − log (cid:0) f ϕ ( a ) (cid:1)(cid:1)(cid:17) expected samples. The update time of ParametricFilter is O ( d ) time and uses O ( d ) as working space.Proof. The proof follows by combining Lemma 4.2 and Lemma 4.3. As k-means clustering has M i = I d and µ i = 1 for all i ≤ n , hence l i = f ϕ i ( a i ) (cid:80) j ≤ i f ϕ j ( a j ) + 8 i − It can be verified by a similar analysis as in the proof 4 of Lemma 4.2. The proof of the second part of Lemma 4.2 andLemma 4.3 will follow as it is. Further note that k-means is a hard clustering hence the (cid:15) -net size is O (2 (cid:15) − dk ) . Hencethe expected size of C returned by ParametricFilter is O (cid:16) dk log(1 /(cid:15) ) (cid:15) (cid:16) log n + log (cid:0) f ϕ ( A ) (cid:1) − log (cid:0) f ϕ ( a ) (cid:1)(cid:17)(cid:17) .At each point the update time is O ( d ) and uses a working space of O ( d ) . The above coreset size is a function of both k , the number of clusters, as well as d , the dimension of data points. Asdiscussed before, such dependence restricts the use of such a coreset to settings where a realistic upper bound on k isknown, and is supplied as input to the coreset construction. Here we explore the possibility of non-parametric coreset.A coreset is called non-parametric if the coreset size is independent of k and it can ensure the desired guarantee for any X with at most n centres.First we state a simple yet important impossibility result. It is not possible to get a non-parametric coreset which ensuresa relative error approximation for any clustering problem. Formally we state it in the following theorem, Theorem 5.1.
There exists a set A , with n points in R d such that there is no C ⊂ A with | C | = o ( n ) , and C which isindependent of k (i.e., (cid:15) > and for all i ∈ [ n ] it ensures the following with constant probability. | f ˜ X i ( A ) − f ˆ X i ( A ) | ≤ (cid:15)f ˜ X i ( A ) where ˜ X i and ˆ X i are the optimal i centres in A and C .Proof. Consider that given A such that it has k natural clusters, i.e., all points highly concentrated in a small radiusaround its corresponding centres and distance between the centres are significantly high. Now for a non-parametriccoreset C we expect that, ∀ X with at most n centres it ensures, | f X ( A ) − f X ( C ) | ≤ (cid:15)f X ( A ) PREPRINT - D
ECEMBER
14, 2020Although note that for our case if | C | < k then for all i ≥ k the optimal centres ˆ X i of the coreset will be C itself.Further for coreset size less than k it is not possible for f ˆ X i ( A ) to ensure a relative approximation to correspondingoptimal cost f ˜ X i ( A ) .Even for small additive error guarantee (similar to the guarantee in Theorem 4.1) it is not clear how to get a non-parametric coreset, due to the union bound over the (cid:15) -net of the query space. Naively to capture the non-parametricnature the query space has to be a function of n , which due to the union bound would reflect in the coreset size.Once we establish the challenge of a non-parametric coreset, we next present a technique that effectively serves asan existential proof to show that additive error non-parametric coresets exist. This algorithm uses an oracle that willreturn upper and lower barrier sensitivity values when queried. We first show that if such an oracle exists, then wecan guarantee the existence of an additive error non-parametric coreset whose size is independent of d and k , anddepends only on log n and the structure of the data. The guarantees given by the coreset will hold for all set of centres k ∈ [ n ] . As of now, without any assumption, implementing the oracle efficiently remains an open question. Undercertain assumption we give an algorithm that returns a non-parametric coreset. We run this method and present it in theexperiment section.To show the existence of non-parametric coreset we combine the sensitivity framework similar to lightweight coresetsalong with the barrier functions technique from [5, 7] in order to decide the sampling probability of each point.We are able to show that the coresets are non-parametric in nature due to the following points.• The algorithm uses upper bounds to the sensitivity scores. Over expectation these upper bounds are independentof k , i.e., the number of centers.• Sampling based on the upper bound returns a strong coreset, i.e., the coreset ensures the guarantee as (1), forall X ∈ R k × d .• As the expected sampling complexity is independent of both k and d we further take an union bound over k ∈ [ n ] . This union bound ensures that the coreset is non-parametric in nature and the guarantee (1) holds forall X with at most n centres in R d .Unlike [10], in order to show a strong coreset guarantee we do not need to utilize VC-dimension based arguments. Here we give a theoretical analysis of our existential result using
DeterministicFilter . It uses an oracle to create anon-parametric coreset for clustering based on some Bregman divergence in table 1. The coreset is constructed viaimportance sampling for which we follow sensitivity based framework along with barrier functions, similar to [7, 5].Let A i − , ϕ i be the same as defined in the previous section. Let C i − be a coreset that the algorithm has maintained sofar and let X is a set with infinitely many elements with every element is some X ∈ R k × d , for all k ≤ n . Similar to[5, 7] we also define sensitivity scores using an upper barrier function (1 + (cid:15) ) f X ( A i − ) and a lower barrier function (1 − (cid:15) ) f X ( A i − ) . We informally call these sensitivity scores as upper barrier and lower barrier sensitivity scores. Atstep i , the upper barrier sensitivity l ui and the lower barrier sensitivity l li are defined as follows: l ui = sup X ∈X f X ( a i )(1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + (cid:15)f ϕ i ( A i ) (3) l li = sup X ∈X f X ( a i ) f X ( C i − ) − (1 − (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) (4)In Algorithm 2, for each a i , the sampling probability p i depends on the scores l ui and l li . We consider that the algorithmgets the upper bound of these scores by some oracle. The algorithm samples each point with respect to the upper boundsof upper barrier sensitivity score (3) and lower barrier sensitivity scores (4). Note that in the above sensitivity scores thequery X acts as centers for A i − and C i − .The algorithm maintains a coreset C i which ensures a deterministic guarantee as in equation (1), ∀ i ∈ [ n ] . We state ourguarantee in the following theorem. Theorem 5.2.
Let A ∈ R n × d for every Bregman divergence d Φ as in table 1 there exists a coresets C for clusteringbased on d Φ such that the following statement is ensured for all X with at most n centres in R d , (cid:12)(cid:12)(cid:12) f X ( C ) − f X ( A ) (cid:12)(cid:12)(cid:12) ≤ (cid:15) ( f X ( A ) + f ϕ ( A )) (5)10 PREPRINT - D
ECEMBER
14, 2020
Algorithm 2
DeterministicFilter
Require:
Input points a i , i = 1 , . . . n ; t > (cid:15) ∈ (0 , Ensure: ( Coreset C , Weights Ω) c u = 2 /(cid:15) + 1; c l = 2 /(cid:15) − ϕ = ∅ ; S = 0; C = Ω = ∅ λ = (cid:107) a (cid:107) min ; ν = (cid:107) a (cid:107) max while i ≤ n do λ = min { λ, (cid:107) a i (cid:107) min } ; ν = max { ν, (cid:107) a i (cid:107) max } Update M i ; µ i = λ/νϕ i = (( i − ϕ i − + a i ) /i ; S = S + f M i ϕ i ( a i ) if i = 1 then p i = 1 else l ui = sup X f X ( a i )(1+ (cid:15) ) f X ( A i − ) − f X ( C i − )+ (cid:15)f ϕi ( A i ) l li = sup X f X ( a i ) f X ( C i − ) − (1 − (cid:15) ) f X ( A i − )+ (cid:15)f ϕi ( A i ) p i = min { , ( c u l ui + c l l li ) } end if Set c i and ω i as (cid:26) a i and /p i w. p. p i ∅ and else ( C i , Ω i ) = ( C i − , Ω i − ) ∪ ( c i , ω i ) end while Return ( C , Ω) Such coreset has O (cid:16) log nµ(cid:15) (cid:16) log n + log (cid:0) f M ϕ ( A ) (cid:1) − log (cid:0) f M ϕ ( a ) (cid:1)(cid:17)(cid:17) expected samples. We prove the above theorem with the following supporting lemmas. We first show that for each point a i , if l ui and l li upper bound the sensitivity scores (3) and (4) respectively, then the coreset returned by DeterministicFilter ensuresthe guarantee as (1).
Lemma 5.1.
Suppose the scores l ui and l li received by oracle in DeterministicFilter upper bound both scores (3) and (4) respectively ∀ i ∈ [ n ] with C i − . DeterministicFilter computes the sampling probability for the i th pointas p i = min { ˜ l i , } where ˜ l i = c u l ui + c l l li . The statement (1) is then true for every i ∈ [ n ] , for all X with at most n centres in R d .Proof. We show this by induction. The proof applies ∀ X with at most n centres in R d . Now for i = 1 this is triviallytrue, as we have p = 1 . So we we have c = a and hence we get, (1 − (cid:15) ) f X ( a ) ≤ f X ( c ) ≤ (1 + (cid:15) ) f X ( a ) (6)Consider that at i − the C i − ensures the following, (1 − (cid:15) ) f X ( A i − ) − (cid:15)f ϕ i − ( A i − ) ≤ f X ( C i − ) ≤ (1 + (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i − ( A i − ) (7)Now we show inductively for C i . Here the sampling probability p i = min { , c u l ui + c l l li } if p i = 1 then the followingis true, (1 − (cid:15) ) f X ( A i − ) − (cid:15)f ϕ i − ( A i − ) ≤ f X ( C i − ) ≤ (1 + (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i − ( A i − )(1 − (cid:15) ) f X ( A i − ) − (cid:15)f ϕ i − ( A i − ) + f X ( a i ) ≤ f X ( C i − ) + f X ( a i ) ≤ (1 + (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i − ( A i − ) + f X ( a i )(1 − (cid:15) ) f X ( A i ) − (cid:15)f ϕ i ( A i ) ≤ f X ( C i ) ≤ (1 + (cid:15) ) f X ( A i ) + (cid:15)f ϕ i ( A i ) Note that f ϕ i ( A i ) ≥ f ϕ i − ( A i − ) . Now if p i < then for the upper barrier function we use the definition of l ui , p i ≥ l ui p i ≥ f X ( a i )(1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + (cid:15)f ϕ i ( A i )(1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + (cid:15)f ϕ i ( A i ) ≥ f X ( a i ) p i PREPRINT - D
ECEMBER
14, 2020 (1 + (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) ≥ f X ( C i − ) + f X ( a i ) p i (1 + (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) ≥ f X ( C i )(1 + (cid:15) ) f X ( A i ) + (cid:15)f ϕ i ( A i ) ≥ f X ( C i ) Note that if a i is not sampled the RHS will be smaller. The above analysis shows that the upper barrier claim in thelemma holds even for i th point. Next for the lower barrier we use the definition of l li , > l li ≥ f X ( a i ) f X ( C i − ) − (1 − (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) f X ( C i − ) − (1 − (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) ≥ f X ( a i ) f X ( C i − ) ≥ (1 − (cid:15) ) f X ( A i − ) − (cid:15)f ϕ i ( A i ) + f X ( a i ) f X ( C i − ) ≥ (1 − (cid:15) ) f X ( A i ) − (cid:15)f ϕ i ( A i ) f X ( C i ) ≥ (1 − (cid:15) ) f X ( A i ) − (cid:15)f ϕ i ( A i ) Note that the above is true when the point a i is not sampled in the coreset. If a i is sampled then the LHS will be bigger.Hence the above analysis shows that the lower barrier claim in the lemma holds even for i th point.The above lemma ensures a deterministic guarantee. It is important to note that due to the barrier function basedsampling, the guarantee stands ∀ X , and we do not require the knowledge of pseudo dimension of the query space. Thisis similar to the deterministic spectral sparsification claim of [7]. Now we discuss a supporting lemma which we use tobound the expected sample size. Lemma 5.2.
Given scalars q, r, s, u, v and w , where q, r, s and w are positive, we define a random variable t as, t = (cid:26) q − u · r with probability p,q − v · r with probability (1 − p ) . Then if rq + w = 1 we get, E (cid:20) st + w − sq + w (cid:21) = pu + (1 − p ) v − uv (1 − u )(1 − v ) (cid:18) sq + w (cid:19) Proof.
The proof is fairly straight forward. Using simple algebra (similar to [28]) we have, q + w − ur = 1 q + w + ur ( q + w ) − − ur ( q + w ) − = 1 q + w + u − u ( q + w ) − q + w − vr = 1 q + w + vr ( q + w ) − − vr ( q + w ) − = 1 q + w + v − v ( q + w ) − So we get, E (cid:20) st + w − sq + w (cid:21) = pu + (1 − p ) v − uv (1 − u )(1 − v ) (cid:18) sq + w (cid:19) Let C i − be the coreset at point i − . Let π i − be the sampling/no-sampling choices that DeterministicFilter made while creating C i − . Let upper and lower barrier sensitivity scores be l ui and l li respectively, which depend on thecoreset maintained so far. E π i − [ l ui ] = E π i − (cid:20) sup X f X ( a i )(1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + (cid:15)f ϕ i ( A i ) (cid:21) (8) E π i − [ l li ] = E π i − (cid:20) sup X f X ( a i ) f X ( C i − ) − (1 − (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) (cid:21) (9)12 PREPRINT - D
ECEMBER
14, 2020
Lemma 5.3.
For all X ∈ R k × d and for all i ∈ [ n ] , we have, E π i − [ l ui ] ≤ f M i ϕ i ( a i ) µ i (cid:15) (cid:80) j ≤ i f M j ϕ j ( a j ) + 12 µ i (cid:15) ( i − E π i − [ l li ] ≤ f M i ϕ i ( a i ) µ i (cid:15) (cid:80) j ≤ i f M j ϕ j ( a j ) + 12 µ i (cid:15) ( i − where M i and µ i is defined for A i and for specific Bregman divergence such that ∀ x ∈ R d , a ∈ A i we have µ i f M i x ( a ) ≤ f x ( a ) ≤ f M i x ( a ) .Proof. Let A i represents the first i points. Further in this proof X ∈ R k × d . For a fixed X we define two scalars ( ζ X ) ui,j and ( ζ X ) li,j as follows, ( ζ X ) ui,j = (cid:15) f X ( A i ) + (1 + (cid:15) f X ( A j )( ζ X ) li,j = − (cid:15) f X ( A i ) + (1 − (cid:15) f X ( A j ) So we have ( ζ X ) ui,i = (1 + (cid:15) ) f X ( A i ) and ( ζ X ) li,i = (1 − (cid:15) ) f X ( A i ) . It is clear that for j ≤ i − we have ( ζ X ) ui − ,j ≥ ( ζ X ) uj,j and ( ζ X ) li − ,j ≤ ( ζ X ) lj,j . Further two more scalars ( γ X ) ui,j and ( γ X ) li,j are defined as follows, ( γ X ) ui,j = ( ζ X ) ui,j − f X ( C j )( γ X ) li,j = f X ( C j ) − ( ζ X ) li,j Note that ( γ X ) ui,i = (1 + (cid:15) ) f X ( A i ) − f X ( C i ) and ( γ X ) li,i = f X ( C i ) − (1 − (cid:15) ) f X ( A i ) . For j ≤ i − we get ( γ X ) ui − ,j ≥ ( γ X ) uj,j and ( γ X ) li − ,j ≥ ( γ X ) lj,j . Let, ( d X ) j +1 = f X ( a j +1 ) p j +1 . If p j +1 < , then we have p j +1 ≥ c u l uj +1 ,and hence we have the following for upper barrier, p j +1 ≥ c u f X ( a j +1 )( γ X ) uj,j + f ϕ j +1 ( A j +1 ) ≥ c u f X ( a j +1 )( γ X ) ui − ,j + f ϕ j +1 ( A j +1 ) ≥ c u f X ( a j +1 )( γ X ) ui − ,j + f ϕ i ( A i )( d X ) j +1 ( γ X ) ui − ,j + f ϕ i ( A i ) ≤ c u Let ( d X ) j +1 ( γ X ) ui − ,j + f ϕi ( A i ) = ( h X ) uj +1 , which is bounded by c u . Similarly for the lower barrier we have, p j +1 ≥ c l f X ( a j +1 )( γ X ) lj,j + f ϕ j +1 ( A j +1 ) ≥ c l f X ( a j +1 )( γ X ) li − ,j + f ϕ j +1 ( A j +1 ) ≥ c l f X ( a j +1 )( γ X ) li − ,j + f ϕ i ( A i )( d X ) j +1 ( γ X ) li − ,j + f ϕ i ( A i ) ≤ c l Let ( d X ) j +1 ( γ X ) li − ,j + f ϕi ( A i ) = ( h X ) lj +1 , which is bounded by c l . Next we apply the Lemma 5.2 to get an upper bound onthe sensitivity scores i.e., sup X f X ( a i )(1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + f ϕ i ( A i ) PREPRINT - D
ECEMBER
14, 2020 sup X f X ( a i ) f X ( C i − ) − (1 − (cid:15) ) f X ( A i − ) + f ϕ i ( A i ) We apply Lemma 5.2 and set q = ( γ X ) ui − ,j , r = ( d X ) j +1 / ( h X ) uj +1 , s = f X ( a i ) and w = (cid:15)f ϕ i ( A i ) . Further let u = ( h X ) uj +1 (1 − p j +1 (1 + (cid:15)/ , v = − ( h X ) uj +1 p j +1 (1 + (cid:15)/ and p = p j +1 .Note that with the above substitution we have rq + w = 1 and t = ( γ X ) ui − ,j +1 . Further with c u ≥ /(cid:15) + 1 we also havethe RHS of the lemma 5.2, pu +(1 − p ) v − uv (1 − u )(1 − v ) ≤ , So we have, E π j +1 (cid:20) f X ( a i )( γ X ) ui − ,j +1 + (cid:15)f ϕ i ( A i ) (cid:21) ≤ E π j (cid:20) f X ( a i )( γ X ) ui − ,j + (cid:15)f ϕ i ( A i ) (cid:21) Let (1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) = f X ( A ui − ) where f X ( A ui − ) = (cid:80) j ≤ i − f X ( a uj ) . Here each term f X ( a uj ) =(1 + (cid:15) − p − j ) f X ( a j ) if a j is present in C i − else f X ( a uj ) = (1 + (cid:15) ) f X ( a j ) . Now the upper sensitivity score withrespect to X ∈ R k × d can be bounded as follows. E π i − (cid:20) f X ( a i ) f X ( A ui − ) + (cid:15)f ϕ i ( A i ) (cid:21) ( i ) ≤ E π i − (cid:20) f M i X ( a i ) f X ( A ui − ) + (cid:15)f ϕ i ( A i ) (cid:21) ( ii ) ≤ E π i − (cid:20) (cid:104) f M i ϕ i ( a i ) + i − (cid:80) a j ∈ A i − [ f M i ϕ i ( a j ) + f M i X ( a j )] (cid:105) (1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) = E π i − (cid:20) (cid:104) f M i ϕ i ( a i ) + i − (cid:80) a j ∈ A i − f M i ϕ i ( a j ) (cid:105) (1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) + E π i − (cid:20) i − f M i X ( A i − )(1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) ( iii ) = E π i − (cid:20) f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − )( γ X ) ui − ,i − + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) + E π i − (cid:20) i − f M i X ( A i − )( γ X ) ui − ,i − + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) ( iv ) ≤ E π i − (cid:20) f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − )( γ X ) ui − ,i − + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) + E π i − (cid:20) i − f M i X ( A i − )( γ X ) ui − ,i − + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) ( v ) ≤ E π (cid:20) f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − )( γ X ) ui − , + (cid:15)f ϕ i ( A i ) (cid:21) + E π (cid:20) i − f M i X ( A i − )( γ X ) ui − , + (cid:15)f ϕ i ( A i ) (cid:21) = 2 f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − ) (cid:15)/ f X ( A i − ) + (cid:15)f ϕ i ( A i ) + i − f M i X ( A i − ) (cid:15)/ f X ( A i − ) + (cid:15)f ϕ i ( A i ) ( vi ) ≤ f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − ) µ i (cid:15) (0 . f M i X ( A i − ) + f M i ϕ i ( A i )) + i − f M i X ( A i − ) µ i (cid:15) (0 . f M i X ( A i − ) + f M i ϕ i ( A i )) ≤ f M i ϕ i ( a i ) µ i (cid:15)f M i ϕ i ( A i ) + 4 f M i ϕ i ( A i − ) µ i (cid:15) ( i − f M i ϕ i ( A i ) + 8 f M i X ( A i − ) µ i (cid:15) ( i − f M i X ( A i − ) ( vii ) ≤ f M i ϕ i ( a i ) µ i (cid:15)f M i ϕ i ( A i ) + 4 µ i (cid:15) ( i −
1) + 8 µ i (cid:15) ( i − ≤ f M i ϕ i ( a i ) µ i (cid:15) (cid:80) j ≤ i f M j ϕ j ( a j ) + 12 µ i (cid:15) ( i − The inequality ( i ) is by upper bounding Bregman divergence by squared Mahalanobis distance. The inequality ( ii ) isdue to applying triangle inequality on the numerator. The ( iii ) equality is by replacing the denominator with the aboveassumption. The ( iv ) inequality is by applying the supporting Lemma 5.2. By recursively applying Lemma 5.2 weget the inequality ( v ) which is independent of the random choices made by DeterministicFilter . The inequality ( vi ) is by using the lower bound on the denominator. The inequality ( vii ) an upper bound on the second and the third14 PREPRINT - D
ECEMBER
14, 2020term. In the final inequality we use the fact that for any µ similar Bregman divergence from [9, 24] we have M j (cid:22) M i for j ≤ i . Further by the property of Bregman divergence we know that f ϕ i ( A i − ) ≥ f ϕ i − ( A i − ) . Hence we have f M i ϕ i ( A i ) ≥ (cid:80) j ≤ i f M j ϕ j ( a j ) . Now note that we have this upper bound for all X ∈ R k × d , which is independent of k .So we have, E π i − (cid:20) f X ( a i )(1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + f ϕ i ( A i ) (cid:21) ≤ f ϕ i ( a i ) (cid:15) (cid:80) j ≤ i f ϕ j ( a j ) + 12 (cid:15) ( i − Now for the lower barrier first apply Lemma 5.2 by setting q = ( γ X ) li − ,j , r = ( d X ) j +1 / ( h X ) lj +1 , s = f X ( a i ) and w = (cid:15)f ϕ i ( A i ) . Further let u = − ( h X ) lj +1 (1 − p j +1 (1 − (cid:15)/ , v = ( h X ) lj +1 p j +1 (1 − (cid:15)/ and p = p j +1 we get, E π j +1 (cid:20) f X ( a i )( γ X ) li − ,j +1 + (cid:15)f ϕ i ( A i ) (cid:21) ≤ E π j (cid:20) f X ( a i )( γ X ) li − ,j + (cid:15)f ϕ i ( A i ) (cid:21) Let ( f X ( C i − ) − (1 − (cid:15) ) f X ( A i − ) = f X ( A li − ) where f X ( A li − ) = (cid:80) j ≤ i − f X ( a lj ) . Here each term f X ( a lj ) =( p − j − (cid:15) ) f X ( a j ) if a j is present in C i − else f X ( a lj ) = ( − (cid:15) ) f X ( a j ) . E π i − (cid:20) f X ( a i ) f X ( A li − ) + (cid:15)f ϕ i ( A i ) (cid:21) ( i ) ≤ E π i − (cid:20) f M i X ( a i ) f X ( A li − ) + (cid:15)f ϕ i ( A i ) (cid:21) ( ii ) ≤ E π i − (cid:20) (cid:104) f M i ϕ i ( a i ) + i − (cid:80) a j ∈ A i − [ f M i ϕ i ( a j ) + f M i X ( a j )] (cid:105) f X ( C i − ) − (1 − (cid:15) ) f M i X ( A i − ) + (cid:15)f ϕ ii ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) = E π i − (cid:20) (cid:104) f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − ) (cid:105) f X ( C i − ) − (1 − (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) + E π i − (cid:20) (cid:15) ( i − f M i X ( A i − ) f X ( C i − ) − (1 − (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) ( iii ) = E π i − (cid:20) f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − )( γ X ) li − ,i − + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) + E π i − (cid:20) i − f M i X ( A i − )( γ X ) li − ,i − + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) ( iv ) ≤ E π i − (cid:20) f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − )( γ X ) li − ,i − + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) + E π i − (cid:20) i − f M i X ( A i − )( γ X ) li − ,i − + (cid:15)f ϕ i ( A i ) (cid:12)(cid:12)(cid:12) π i − (cid:21) ( v ) ≤ E π (cid:20) f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − )( γ X ) li − ,i − + (cid:15)f ϕ i ( A i ) (cid:21) + E π (cid:20) i − f M i X ( A i − )( γ X ) li − ,i − + (cid:15)f ϕ i ( A i ) (cid:21) ( vi ) ≤ f M i ϕ i ( a i ) + i − f M i ϕ i ( A i − ) µ i (cid:15) (0 . f M i X ( A i − ) + f M i ϕ i ( A i )) + i − f M i X ( A i − ) µ i (cid:15) (0 . f M i X ( A i − ) + f M i ϕ i ( A i )) ≤ f M i ϕ i ( a i ) µ i (cid:15)f M i ϕ i ( A i ) + 4 f M i ϕ i ( A i − ) µ i (cid:15) ( i − f M i ϕ i ( A i ) + 8 f M i X ( A i − ) µ i (cid:15) ( i − f M i X ( A i − ) ( vii ) ≤ f M i ϕ i ( a i ) µ i (cid:15)f M i ϕ i ( A i ) + 4 µ i (cid:15) ( i −
1) + 8 µ i (cid:15) ( i − ≤ f M i ϕ i ( a i ) µ i (cid:15) (cid:80) j ≤ i f M j ϕ j ( a j ) + 12 µ i (cid:15) ( i − The inequality ( i ) is by upper bounding Bregman divergence by squared Mahalanobis distance. The inequality ( ii ) isdue to applying triangle inequality on the numerator. The ( iii ) equality is by replacing the denominator with the aboveassumption. The ( iv ) inequality is by applying the supporting Lemma 5.2. By recursively applying Lemma 5.2 we getthe inequality ( v ) which is independent of the random choices made by DeterministicFilter . The inequality ( vi ) is by using the lower bound on the denominator. The inequality ( vii ) an upper bound on the second and the third term.In the final inequality is due to the same reason as for the upper barrier upper bound. Further the expected upper bound15 PREPRINT - D
ECEMBER
14, 2020is independent of k . So we have, E π i − (cid:20) f X ( a i )(1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + f ϕ i ( A i ) (cid:21) ≤ f M i ϕ i ( a i ) µ i (cid:15) (cid:80) j ≤ i f M j ϕ j ( a j ) + 12 µ i (cid:15) ( i − In this lemma it is worth noting that the expected upper bounds on l ui and l li do not use any bicriteria approximation, norare dependent on k . In order to have these upper bounds valid for any X with at most k centres in R d , where k ≤ n , wetake a union bound over all k ∈ [ n ] . By lemma 5.1 and lemma 5.3 we claim that our coresets are non-parametric innature. Next we bound the expected sample size. Lemma 5.4.
For the above setup the term (cid:80) i ≤ n c u l ui + c l l li is O (cid:16) µ(cid:15) (cid:16) log n + log (cid:0) f M ϕ ( A ) (cid:1) − log (cid:0) f M ϕ ( a ) (cid:1)(cid:17)(cid:17) .Proof. Here in order to bound the expected sample size we first bound the expected sample size of coreset the coreset ∀ X ∈ R k × d . For this we bound the expected sampling probability i.e., E π i − [ p i ] . E π i − [ p i ] ( i ) = c u l ui + c l l li ≤ c u f M i ϕ i ( a i ) µ i (cid:15) (cid:80) j ≤ i f M j ϕ j ( a j ) + 12 c u µ i (cid:15) ( i −
1) + 2 c l f M i ϕ i ( a i ) (cid:15) (cid:80) j ≤ i f M j ϕ j ( a j ) + 12 c l µ i (cid:15) ( i − ≤ f M i ϕ i ( a i ) µ i (cid:15) (cid:80) j ≤ i f M j ϕ j ( a j ) + 48 µ i (cid:15) ( i − Now we bound the total expected sample size, (cid:88) ≤ i ≤ n E [ p i ] ≤ (cid:88) ≤ i ≤ n (cid:32) f M i ϕ i ( a i ) µ i (cid:15) (cid:80) j ≤ i f M j ϕ j ( a j ) + 48 µ i (cid:15) ( i − (cid:33) ≤
48 log nµ i (cid:15) + (cid:88) ≤ i ≤ n (cid:32) f M i ϕ i ( a i ) µ i (cid:15) (cid:80) j ≤ i f M j ϕ j ( a j ) (cid:33) Let the term f M iϕi ( a i ) (cid:80) j ≤ i f M jϕj ( a j ) = q i ≤ . In the following analysis we bound summation of this term i.e., (cid:80) i ≤ n q i . Forthat consider the term (cid:80) j ≤ i f M j ϕ j ( a j ) as follows, (cid:88) j ≤ i f M j ϕ j ( a j ) = (cid:88) j ≤ i − f M j ϕ j ( a j ) (cid:18) f M i ϕ i ( a i ) (cid:80) j ≤ i − f M j ϕ j ( a j ) (cid:19) ≥ (cid:88) j ≤ i − f M j ϕ j ( a j ) (cid:18) f M i ϕ i ( a i ) (cid:80) j ≤ i f M j ϕ j ( a j ) (cid:19) = (cid:88) j ≤ i − f M j ϕ j ( a j )(1 + q i ) ≥ exp( q i / (cid:88) j ≤ i − f M j ϕ j ( a j )exp( q i / ≤ (cid:80) j ≤ i f M j ϕ j ( a j ) (cid:80) j ≤ i − f M j ϕ j ( a j ) Now as we know that (cid:80) j ≤ i f M j ϕ j ( a j ) ≥ (cid:80) j ≤ i − f M j ϕ j ( a j ) hence following product results into a telescopic productand we get, (cid:89) ≤ i ≤ n exp( q i / ≤ (cid:80) j ≤ n f M j ϕ j ( a j ) f M ϕ ( a ) PREPRINT - D
ECEMBER
14, 2020 ≤ f M ϕ ( A ) f M ϕ ( a ) Now taking log in both sides we get (cid:80) ≤ i ≤ n q i ≤ (cid:0) f M ϕ ( A ) (cid:1) − (cid:0) f M ϕ ( a ) (cid:1) . (cid:88) i ≤ n E [ p i ] ≤ µ(cid:15) (cid:16) n + log (cid:0) f M ϕ ( A ) (cid:1) − log (cid:0) f M ϕ ( a ) (cid:1)(cid:17) Here we consider ∀ i ∈ [ n ] we have µ = µ n ≤ µ i and M = M n (cid:23) M i for A .The above sum is independent of k and d . Now to ensure a non-parametric coreset, we have to ensure this event(bounded sampling complexity) for all X with at most n centres in R d . Hence by taking a union bound over k ∈ [ n ] ,the expected sample size for a non-parametric coreset from DeterministicFilter which ensures a deterministicguarantee is O (cid:18) log nµ(cid:15) (cid:16) log n + log (cid:0) f M ϕ ( A ) (cid:1) − log (cid:0) f M ϕ ( a ) (cid:1)(cid:17)(cid:19) . Note that to get such non-parametric coreset, it isnecessary to have such an oracle which upper bounds the upper and lower sensitivity scores.Now we present an algorithmic version of DeterministicFilter under certain assumption. Let X j for j ∈ [ n ] represents the query space where each query X ∈ X j has j centres in R d . Now for each point a i and X j we considerthe following two random variables, r u ( i,j ) ( X ) = f X ( a i )(1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + (cid:15)f ϕ i ( A i ) r l ( i,j ) ( X ) = f X ( a i ) f X ( C i − ) − (1 − (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) Here randomness is over X ∈ X j . Let both r u ( i,j ) and r l ( i,j ) follow the bounded CDF assumption in [29]. The bellowassumption is stated for r u ( i,j ) for some appropriate query space X j . We consider that a similar assumption is also truefor r l ( i,j ) . Assumption 5.1.
There is a pair of universal constants K and K (cid:48) such that such that for each pair ( i, j ) with i ∈ [ n ] and X j ∈ ∪ l X l , the CDF of the random variables r ui ( X ) for X ∈ X j denoted by G ( i,j ) () satisfies, G ( i,j ) ( x ∗ /K ) ≤ exp( − /K (cid:48) ) where x ∗ = min { y ∈ [0 ,
1] : G ( i,j ) ( y ) = 1 } . We consider the above assumption is true for all pair of ( i, j ) , where i corresponds to point a i and j corresponds tothe query space X j . Now the following two lemma’s are similar to lemma 6 and 7 in [29]. Here we state them forcompleteness. Lemma 5.5 is stated for all pairs of ( i, j ) such that i ≤ n and X j ∈ ∪ l ≤ n X l . Lemma 5.5.
Let
K, K (cid:48) > be universal constants and let X j be the query space as defined above with CDF G ( i,j ) ( · ) satisfying G ( i,j ) ( x ∗ /K ) ≤ exp( − /K (cid:48) ) , where x ∗ = min { y ∈ [0 ,
1] : G ( i,j ) ( y ) = 1 } . Let Y j = { X , X , . . . , X m } be a set of m = |Y j | i.i.d. samples each drawn from X j . Let X m +1 ∼ X j be an iid samples, then P (cid:16) K max X ∈X j r u ( i,j ) ( X ) ≤ r u ( i,j ) ( X m +1 ) (cid:17) ≤ exp( − m/K (cid:48) ) P (cid:16) K max X ∈X j r l ( i,j ) ( X ) ≤ r l ( i,j ) ( X m +1 ) (cid:17) ≤ exp( − m/K (cid:48) ) Proof.
Let X max = arg max X ∈X j r u ( i,j ) ( X ) , then P ( K max X ∈X j r u ( i,j ) ( X ) ≤ r u ( i,j ) ( X m +1 )) = (cid:90) x ∗ P ( Kr u ( i,j ) ( X max ) ≤ y/K | r u ( i,j ) ( X m +1 ) = y ) d P ( y ) i = (cid:90) x ∗ P ( Kr u ( i,j ) ( X max ) ≤ y/K ) m P ( y ) ≤ (cid:90) x ∗ G ( i,j ) ( y/K ) m P ( y ) PREPRINT - D
ECEMBER
14, 2020 ii ≤ G ( i,j ) ( x ∗ /K ) m (cid:90) x ∗ P ( y )= G ( i,j ) ( x ∗ /K ) m ≤ exp( − m/K (cid:48) ) Here ( i ) is because { X , X , . . . , X m } are i.i.d. from X j . Further ( ii ) is due to the assumption 5.1. Similarly for r li isalso proved.Let for all j ≤ n , there is a finite set Y j ⊂ X j such that the empirical sensitivity scores ˜ l ui = max j ˜ l u ( i,j ) and ˜ l li = max j ˜ l l ( i,j ) , such that ˜ l u ( i,j ) and ˜ l l ( i,j ) are defined as follows, ˜ l u ( i,j ) = max X ∈Y j f X ( a i )(1 + (cid:15) ) f X ( A i − ) − f X ( C i − ) + (cid:15)f ϕ i ( A i )˜ l l ( i,j ) = max X ∈Y j f X ( a i ) f X ( C i − ) − (1 + (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) Now in the following lemma we establish the notion that empirical sensitivity scores are good approximation to the truesensitivity scores. It is also used to decide the size of each finite set Y j . Lemma 5.6.
Let δ ∈ (0 , , consider the set Y j ⊂ X j of size |Y j | ≥ (cid:100) K (cid:48) log( n/δ ) (cid:101) , then P X ∈X j ( ∃ i ∈ [ n ] : K ˜ l u ( i,j ) ≤ r u ( i,j ) ( X )) ≤ δ P X ∈X j ( ∃ i ∈ [ n ] : K ˜ l l ( i,j ) ≤ r l ( i,j ) ( X )) ≤ δ Proof.
The proof is very simple, which mainly follows from lemma 5.5, P ( E i ) : P X ∈X j ( K max X (cid:48) ∈Y j r u ( i,j ) ( X (cid:48) ) ≤ r u ( i,j ) ( X )) ≤ exp( −|Y j | /K (cid:48) ) Next, in order to ensure that there exists an i ∈ [ n ] such that P ( E i ) ≤ δ , we take a union bound over all i ∈ [ n ] and get |Y j | ≥ (cid:100) K (cid:48) log( n/δ ) (cid:101) . Similarly, for the ˜ l l ( i,j ) is also proved.Now based on above assumption we present algorithm (3), which returns coreset for clustering via Bregman divergence.Note that without the above assumption 5.1 our algorithm acts as an heuristic. In this algorithm instead of getting the l ui and l li value from an oracle, we use (˜ l ui , ˜ l li ) and the expected upper bounds as in Lemma 5.3. The algorithm requires {Y , Y , . . . , Y n } where each Y j has O (log( n/δ )) queries. Note although the algorithm uses the knowledge on upperbound of the cluster centres due to the query sets {Y , Y , . . . , Y n } , but the expected coreset size is independent ofcluster centres, mainly by lemma 5.3 and 5.4. In practice one might have the knowledge of upper bound of clustercentres e.g., B such that B (cid:28) n .Further as we also use expected upper bounds while taking the sampling decision, hence here we maintain t coresetsinstead of and have an additional reweighing of /t for each sampled points. We maintain t coresets in order toimprove the chance that the expected upper bound from Lemma 5.3 actually upper bounds the true sensitivity scores,i.e., t (cid:88) j ≤ t sup X ∈X f X ( a i )(1 + (cid:15) ) f X ( A i − ) − f X ( C ji − ) + (cid:15)f ϕ i ( A i ) ≤ f ϕ i ( a i ) (cid:15)µ i S + 12 (cid:15)µ i ( i − t (cid:88) j ≤ t sup X ∈X f X ( a i ) f X ( C ji − ) − (1 − (cid:15) ) f X ( A i − ) + (cid:15)f ϕ i ( A i ) ≤ f ϕ i ( a i ) (cid:15)µ i S + 12 (cid:15)µ i ( i − Here an important point to note is that unlike
DeterministicFilter , the coreset from
NonParametricFilter onlyensures the guarantee with high probability. This is due to lemma 5.6 where the empirical sensitivity score onlyapproximate the true sensitivity scores with some probability and also the above upper bound is only over expectation.We propose our algorithm as
NonParametricFilter .Note that
NonParametricFilter is computationally expensive for computing the terms ˜ l ui and ˜ l li . Now if theassumption 5.1 on G ( i,j ) is not true then the algorithm becomes a heuristic. The possible query spaces X j can have a18 PREPRINT - D
ECEMBER
14, 2020
Algorithm 3
NonParametricFilter
Require:
Points a i , i = 1 , . . . n ; t > (cid:15) ∈ (0 , Y = ∪ j ≤ Y j Ensure: { Coreset ,Weights } : { ( C , C , . . . , C t ) , (Ω , Ω , . . . , Ω t ) } c u = 2 /(cid:15) + 1; c l = 2 /(cid:15) − ϕ = ∅ ; S = 0; C = . . . = Ω t = ∅ λ = (cid:107) a (cid:107) min ; ν = (cid:107) a (cid:107) max while i ≤ n do λ = min { λ, (cid:107) a i (cid:107) min } ; ν = max { ν, (cid:107) a i (cid:107) max } Update M i ; µ i = λ/νϕ i = (( i − ϕ i − + a i ) /i ; S = S + f M i ϕ i ( a i ) if i = 1 then p i = 1 else ˆ l ui = f M iϕi ( a i ) (cid:15)µ i S + (cid:15)µ i ( i − ˆ l li = f M iϕi ( a i ) (cid:15)µ i S + (cid:15)µ i ( i − ˜ l ui = max X ∈Y f M i X ( a i ) µ i ((1+ (cid:15) ) f M i X ( A i − ) − (cid:80) k ≤ t f M i X ( C ki − )+ (cid:15)f M iϕi ( A i )) ˜ l li = max X ∈Y f M i X ( a i ) µ i ( (cid:80) k ≤ t f M i X ( C ki − ) − (1+ (cid:15) ) f M i X ( A i − )+ (cid:15)f M iϕi ( A i )) p i = min { , ( c u (ˆ l ui + ˜ l ui ) + c l (ˆ l li + ˜ l li ) } end if Set c i and ω i as (cid:26) a i and / ( tp i ) w. p. p i ∅ and else ( C i , Ω i ) = ( C i − , Ω i − ) ∪ ( c i , ω i ) Set c i and ω i as (cid:26) a i and / ( tp i ) w. p. p i ∅ and else ( C i , Ω i ) = ( C i − , Ω i − ) ∪ ( c i , ω i ) ...Set c ti and ω ti as (cid:26) a i and / ( tp i ) w. p. p i ∅ and else ( C ti , Ω ti ) = ( C ti − , Ω ti − ) ∪ ( c ti , ω ti ) end while Return { ( C , C , . . . C t ) , (Ω , Ω , . . . , Ω t ) } query X which has centres from the set of input points, or centres chosen randomly from { µ − R, µ + R } d where µ is the mean of input points and the farthest point from µ is at a distance R . We discuss these in detail in our revisedversion, where we also present appropriate empirical results for non-parametric coreset . Again in the case of k-means clustering the algorithm
DeterministicFilter shows an existential coreset C which isnon-parametric in nature. In the following corollary we state guarantee that DeterministicFilter ensures in thecase of k-means clustering.
Corollary 5.1.
Let A ∈ R n × d and the points fed to DeterministicFilter , it returns a set of coresets C whichensures the guarantee as in equation (1) for any X with at most n centres in R d . The returned coresets has expectedsamples size as O (cid:18) log n(cid:15) (cid:16) log n + log (cid:0) f ϕ ( A ) (cid:1) − log (cid:0) f ϕ ( a ) (cid:1)(cid:17)(cid:19) . Work under progress PREPRINT - D
ECEMBER
14, 2020
Proof.
We prove it using the Lemmas 5.2, 5.3 and 5.4. As for k-means clustering we have M i = I d and µ i = 1 foreach i ≤ n , hence ∀ i ∈ [ n ] we have, l i = f ϕ i ( a i ) (cid:15) (cid:80) j ≤ i f ϕ j ( a j ) + 12 (cid:15) ( i − It can be verified by a similar analysis as in the proof of Lemma 5.2 and 5.3. The rest of the lemma’s proof follows as it isand we get a required guarantee, i.e., a non-parametric coreset for k-means clustering with k ≤ n . The coreset returnedby DeterministicFilter has expected sample size as O (cid:18) log n(cid:15) (cid:16) log n + log (cid:0) f ϕ ( A ) (cid:1) − log (cid:0) f ϕ ( a ) (cid:1)(cid:17)(cid:19) .As this is just an existential result, we present a heuristic algorithm for the same problem in section 6.2. Further weshow that the coreset returned by our heuristic algorithm captures the non-parametric nature of the coreset and performswell on real world data. Here we discuss that our existential non-parametric coresets from algorithm 2 can also be used to approximate DP-Meansclustering [2], based on squared euclidean. We define a slightly different cost. cost DP ( A , X ) = (cid:88) a i ∈ A f X ( a i ) + | X | λ Here f X ( a i ) is the cost based on some d Φ Bregman divergences as introduced earlier. It is not difficult to see that thecoreset from
DeterministicFilter ensures an additive error approximation for this definition of Bregman divergencebased DP-Means clustering.
Lemma 5.7.
The non-parametric coreset C from DeterministicFilter ensures the following for all X with at most n centres in R d , | cost DP ( C , X ) − cost DP ( A , X ) | ≤ (cid:15) ( f X ( A ) + f ϕ ( A )) Proof.
Note that at any | X | ≥ n , the cost is at least λ | X | ≥ nλ . Hence without loss of generality, we can restrict | X | ≤ n , since the optimal will be in this range. We know that for a parameter λ , cost DP ( A , X ) = f X ( A ) + | X | λ Now if one applies DP-Means on the coreset from
DeterministicFilter we get the following, (cid:12)(cid:12)(cid:12) (cid:88) j ≤ t cost DP ( C , X ) − cost DP ( A , X ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) f X ( C ) − f X ( A ) (cid:12)(cid:12)(cid:12) ≤ (cid:15) ( f X ( A ) + f ϕ ( A )) The last inequality is by Theorem 5.2.Now we claim that by allowing a small additive error approximation our coreset size significantly improves uponcoresets for relative error approximation for DP-Means clustering. Unlike [2] our coresets are existential but it is muchsmaller, as in practice O ( d d ) (cid:29) O (log n ) . Theorem 5.3.
For (cid:15) ∈ (0 , , let C be the existential non-parametric coreset for A (Theorem 5.2), X C and X A arethe optimal cluster centers for the DP-Means clustering on C and A . Then C ensures the following,cost DP ( A , X C ) ≤ cost DP ( A , X A ) + (cid:15) ( f X C ( A ) + f X A ( A ) + 2 f ϕ ( A )) The expected size of such existential coreset C is O (cid:16) log nµ(cid:15) (cid:16) log n + log (cid:0) f M ϕ ( A ) (cid:1) − log (cid:0) f M ϕ ( a ) (cid:1)(cid:17)(cid:17) .Proof. Let X C and X A are the optimal cluster centres for DP-Means clustering on C and A respectively. Now weknow that, cost DP ( A , X C ) − (cid:15) ( f X C ( A ) + f ϕ ( A )) ≤ cost DP ( C , X C ) ≤ cost DP ( C , X A ) ≤ cost DP ( A , X A ) + (cid:15) ( f X A ( A ) + f ϕ ( A )) cost DP ( A , X C ) ≤ cost DP ( A , X A ) + (cid:15) ( f X C ( A ) + f X A ( A ) + 2 f ϕ ( A )) Here the first inequality is due to lemma 5.7. The rest of inequalities are due to strong coreset guarantee of C .20 PREPRINT - D
ECEMBER
14, 2020
In this section we demonstrate the performance of our algorithm
ParametricFilter . We also demonstrate theheuristic algorithm
NonParametricFilter for which we empirically show that the returned coreset captures thenon-parametric nature.
ParametricFilter
Here we empirically show that the coresets constructed using our proposed online algorithms outperform the base-line coreset construction algorithms. We compare the performance of our algorithm with other baselines such as
Uniform and
TwoPass in solving the clustering problem. We consider that each of the following described algorithmreceives data in a streaming fashion.1)
ParaFilter : Our
ParametricFilter
Algorithm 1.2)
TwoPass : This is similar to ours
ParametricFilter algorithm, but has the knowledge of ϕ . i.e. substituting ϕ i = ϕ , ∀ i , in Algorithm 1.3) Uniform : Each arrived point here is sampled with probability r/n , where r is a parameter used to control theexpected number of samples.We compare the performance of the above described algorithms on following datasets:1) KDD(BIO-TRAIN) : , samples with features.2) SONGS : , songs from the Million song dataset with features.For these datasets we consider k = 100 and k = 200 and consider squared Euclidean as Bregman divergence (seeFigure 1).3) MNIST : ,
000 28 × dimension digits dataset. We consider k = 5 and k = 10 on this dataset and relative entropyas Bregman divergence (see Figure 2). R e l a t i v e E rr o r KDD: BIO-TRAIN (K=100)
UniformTwoPassParaFilter
KDD: BIO-TRAIN (K=200)
UniformTwoPassParaFilter0.25 0.50 0.75 1.00 % of data R e l a t i v e E rr o r SONGS (K=100)
UniformTwoPassParaFilter 0.25 0.50 0.75 1.00 % of data
SONGS (K=200)
UniformTwoPassParaFilter
Figure 1: Relative error v/s coreset size. Squared Euclidean Distance as Bregman Divergence.Using each of the above described algorithm, we subsample coresets of different sizes. Once the coreset is generated,we run the weighted k-means++ [30] on the coreset to obtain the centers. We then use these centers and compute thequantization error ( C s ) on the full data set. Additionally, we compute quantization error by running k-means++ on thefull data set ( C F ). And then we compare the algorithms based on the Relative-Error ( η ) defined as | C s − C F | /C F .The relative error mentioned in the Figures 1 and 2 is averaged over 10 runs of each of the algorithms. The sample sizeare in expectation.Figure 1 highlights the change in η with the increase in the coreset size for K = 100 , on KDD:BIO-TRAIN and
SONGS datasets when squared Euclidean distance is considered as the Bregman divergence. As the coreset sizeincreases the relative error decreases for all the algorithms. However, our algorithm
ParametricFilter ( ParaFilter )outperforms
Uniform , and performs equivalent to that of
TwoPass across all the datasets. Additionally, we also21
PREPRINT - D
ECEMBER
14, 2020Figure 2: Relative error v/s coreset size. Relative Entropy as Bregman Divergence.compared the performance of our algorithm with the
Lighweight-Coreset construction algorithm [1], which is an offlinealgorithm.
Lightweight-Coreset algorithm performs better than that of
ParaFilter , but the difference is tiny.Similarly, Figure 2 shows the performance of the algorithms when Relative entropy is used as Bregman divergence, onthe
MNIST dataset.Our algorithm
ParaFilter outperforms
Uniform and performs equivalent to that of
TwoPass . NonParametricFilter
Here we empirically show that the coreset returned by the heuristic algorithm captures the non-parametric nature. In thisalgorithm instead of getting the l ui and l li value from an oracle, we use the expected upper bounds as in Lemma 5.3. Notethat without any assumption on query space 5.1 and without using empirical sensitivity scores NonParametricFilter is a heuristic method where for each point a i it only needs to update ϕ i , µ i and µ i to decide the sampling probability.In this case NonParametricFilter is an online algorithm, which takes decision about sampling a point a i beforeprocessing a i +1 . We run our algorithm heuristic for Bregeman divergence as euclidean distance i.e., k-means clustering problem. Werun
NonParametricFilter along with other baseline coreset creation algorithms such as
Uniform , Offline and
TwoPass , create coresets for various (cid:15) values, such as (1 . , . , . , . and compare their performances. Followingis the brief summary of the algorithms used for coreset creation:1. Uniform : The algorithm has the knowledge of n and it samples each point with probability r/n , where r controls the expected number of samples.2. Offline : We run offline version of lightweight coresets [1].3.
TwoPass (Filter-2-Pass): The algorithm has the knowledge of the ϕ , i.e., the mean point of A . The algorithmruns the NonParametricFilter , where instead of ϕ i it uses the ϕ .4. Filter-NP : We run
NonParametricFilter algorithm 2 – our proposed non-parametric algorithm.To ensure that the number of points sampled by each algorithm is equivalent for a fixed value of (cid:15) , we first run our
NonParametricFilter algorithm, and then use the number of points sampled by
NonParametricFilter as theexpected number of points to be sampled by other algorithms.
We evaluate the performance of the above mentioned algorithms on the
KDD(BIO-TRAIN) dataset which has , samples with features. Once the coreset is obtained from each of the sampling methods, we run weighted k-means++clustering [30] on them for various values of k such as (50 , , , and get the centers. These centers areconsidered as initial centres while running k-means clustering on the coreset and finally obtain the centres. Once thecenters are obtained, we compute the quantization error on the entire dataset with respect to the corresponding centers,i.e. C S ( A ) where S is the set of centers returned from the coreset. We also run k -means clustering on the entire data forthese values of k get the quantization error, i.e., C F ( A ) where F is the set of centers obtained by running k -means++on the entire set of points A . Finally we report the relative error η , i.e., η = | C S ( A ) − C F ( A ) | C F ( A ) .22 PREPRINT - D
ECEMBER
14, 2020
50 100 150 200 250 300 k = Number of Centers R e l a t i v e E rr o r = 1.00 UniformOfflineFilter-2-PassFilter-NP
50 100 150 200 250 300 k = Number of Centers R e l a t i v e E rr o r = 0.75 UniformOfflineFilter-2-PassFilter-NP
50 100 150 200 250 300 k = Number of Centers R e l a t i v e E rr o r = 0.50 UniformOfflineFilter-2-PassFilter-NP
50 100 150 200 250 300 k = Number of Centers R e l a t i v e E rr o r = 0.25 UniformOfflineFilter-2-PassFilter-NP
Figure 3: Change in Relative Error η with respect to number of centers k for various values of (cid:15) .For each of the algorithms mentioned above, for each value of (cid:15) we run random instances, compute η = | C S ( A ) − C F ( A ) | C F ( A ) for each of the instances and report the median of the η values. We consider various values of (cid:15) such as { . , . , . , . } , and for each value of (cid:15) the approximate number of points sampled are { , , , } respectively. Note that to capture the notion of the non-parametric nature in the coreset, we run k-means clustering for afixed coreset for different value of k , i.e., { , , , } . Figure 3 shows the change in the value of Relative Error η with respect to the change in number of centers k , for variousvalues of (cid:15) . With the decrease in the value of (cid:15) we can note that the value of η overall decreases for all the algorithms.This is as expected, because with the decrease in (cid:15) , the additive error part decreases and the coreset size increases. Wecan also observe that for each value of (cid:15) , with the change in the value of k , the relative error η remains almost constantfor all the algorithms except Uniform , which shows the non-parametric nature of the importance sampling algorithms.Also, we can note that, even if
Offline beats our algorithm
NonParametricFilter in terms of the relative error,the difference is small. The above empirical results provide an evidence that such a coreset can also be used to learnextreme clustering [8]. 23
PREPRINT - D
ECEMBER
14, 2020
In this work we present online algorithm
ParametricFilter that returns a coreset for clustering based on Bregmandivergences. With
DeterministicFilter we further show that non-parametric coresets with additive error exist, butfinding such a coreset in practice in an efficient manner is yet an open question. We also show that the existential coresetcan also be used to get a well approximate solution for DP-Means problem. Without any assumption, the existentialcoreset relies on an oracle whose implementation is not known. The work related to
NonParametricFilter is underprogress. In our next revision we will plan to discuss the assumption 5.1 in greater details. Further we also plan toprovide appropriate empirical results demonstrating
NonParametricFilter . We are grateful to the anonymous reviewers for their helpful feedback. This project has received funding from theEngineering and Physical Sciences Research Council, UK (EPSRC) under Grant Ref: EP/S03353X/1. Anirbanacknowledges the kind support of the N. Rama Rao Chair Professorship at IIT Gandhinagar, the Google India AI/MLaward (2020), Google Faculty Award (2015), and CISCO University Research Grant (2016).
References [1] Olivier Bachem, Mario Lucic, and Andreas Krause. Scalable k-means clustering via lightweight coresets. In
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages1119–1127, 2018.[2] Olivier Bachem, Mario Lucic, and Andreas Krause. Coresets for nonparametric estimation-the case of dp-means.In
ICML , pages 209–217, 2015.[3] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences.
Journal of machine learning research , 6(Oct):1705–1749, 2005.[4] Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: Importance sampling isnearly optimal, 2020.[5] Michael B Cohen, Cameron Musco, and Jakub Pachocki. Online row sampling. In
Approximation, Randomization,and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2016) . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.[6] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In
Proceedingsof the forty-third annual ACM symposium on Theory of computing , pages 569–578. ACM, 2011.[7] Joshua Batson, Daniel A Spielman, and Nikhil Srivastava. Twice-ramanujan sparsifiers.
SIAM Journal onComputing , 41(6):1704–1721, 2012.[8] Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. A hierarchical algorithm forextreme clustering. In
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , pages 255–264, 2017.[9] Marcel R Ackermann and Johannes Blömer. Coresets and approximate clustering for bregman divergences. In
Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms , pages 1088–1097. SIAM,2009.[10] Michael Langberg and Leonard J Schulman. Universal ε -approximators for integrals. In Proceedings of thetwenty-first annual ACM-SIAM symposium on Discrete Algorithms , pages 598–607. SIAM, 2010.[11] Devdatt P Dubhashi and Alessandro Panconesi.
Concentration of measure for the analysis of randomizedalgorithms . Cambridge University Press, 2009.[12] Rachit Chhaya, Anirban Dasgupta, and Supratim Shit. On coresets for regularized regression. arXiv preprintarXiv:2006.05440 , 2020.[13] Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Approximating extent measures of points.
Journal of the ACM (JACM) , 51(4):606–635, 2004.[14] Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. stat ,1050:4, 2017.[15] David P Woodruff et al. Sketching as a tool for numerical linear algebra.
Foundations and Trends® in TheoreticalComputer Science , 10(1–2):1–157, 2014. 24
PREPRINT - D
ECEMBER
14, 2020[16] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In
Proceedings of thethirty-sixth annual ACM symposium on Theory of computing , pages 291–300. ACM, 2004.[17] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reductionfor k-means clustering and low rank approximation. In
Proceedings of the forty-seventh annual ACM symposiumon Theory of computing , pages 163–172, 2015.[18] Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for offline and streaming coresetconstructions. arXiv preprint arXiv:1612.00889 , 2016.[19] Artem Barger and Dan Feldman. Deterministic coresets for k-means of big sparse data.
Algorithms , 13(4):92,2020.[20] Dan Feldman, Mikhail Volkov, and Daniela Rus. Dimensionality reduction of massive sparse datasets usingcoresets. In
Advances in Neural Information Processing Systems , pages 2766–2774, 2016.[21] Olivier Bachem, Mario Lucic, and Silvio Lattanzi. One-shot coresets: The case of k-clustering. In
Internationalconference on artificial intelligence and statistics , pages 784–792, 2018.[22] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets fork-means, pca, and projective clustering.
SIAM Journal on Computing , 49(3):601–657, 2020.[23] Christos Boutsidis and Malik Magdon-Ismail. Deterministic feature selection for k-means clustering.
IEEETransactions on Information Theory , 59(9):6099–6110, 2013.[24] Mario Lucic, Olivier Bachem, and Andreas Krause. Strong coresets for hard and soft bregman clustering withapplications to exponential family mixtures. In
Artificial intelligence and statistics , pages 1–9, 2016.[25] Edo Liberty, Ram Sriharsha, and Maxim Sviridenko. An algorithm for online k-means clustering. In , pages 81–89.SIAM, 2016.[26] Silvio Lattanzi and Sergei Vassilvitskii. Consistent k-clustering. In
International Conference on Machine Learning ,pages 1975–1984, 2017.[27] Aditya Bhaskara and Aravinda Kanchana Rwanpathirana. Robust algorithms for online k -means clustering. In Algorithmic Learning Theory , pages 148–173, 2020.[28] Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one elementof a given matrix.
The Annals of Mathematical Statistics , 21(1):124–127, 1950.[29] Cenk Baykal, Lucas Liebenwein, Igor Gilitschenski, Dan Feldman, and Daniela Rus. Data-dependent coresets forcompressing neural networks with applications to generalization bounds. In
International Conference on LearningRepresentations , 2018.[30] David Arthur and Sergei Vassilvitskii. k-means++ the advantages of careful seeding. In