Machine Learning Friendly Set Version of Johnson-Lindenstrauss Lemma
MMachine Learning Friendly Set Version ofJohnson-Lindenstrauss Lemma
Mieczysław A. Kłopotek ( [email protected] )Institute of Computer Science of the Polish Academy of Sciencesul. Jana Kazimierza 5, 01-248 Warszawa PolandNovember 10, 2017
Abstract
In this paper we make a novel use of the Johnson-LindenstraussLemma. The Lemma has an existential form saying that there existsa JL transformation f of the data points into lower dimensional spacesuch that all of them fall into predefined error range δ .We formulate in this paper a theorem stating that we can choosethe target dimensionality in a random projection type JL linear trans-formation in such a way that with probability − (cid:15) all of them fall intopredefined error range δ for any user-predefined failure probability (cid:15) .This result is important for applications such a data clusteringwhere we want to have a priori dimensionality reducing transformationinstead of trying out a (large) number of them, as with traditionalJohnson-Lindenstrauss Lemma. In particular, we take a closer look atthe k -means algorithm and prove that a good solution in the projectedspace is also a good solution in the original space. Furthermore, underproper assumptions local optima in the original space are also ones inthe projected space. We define also conditions for which clusterabilityproperty of the original space is transmitted to the projected space, sothat special case algorithms for the original space are also applicablein the projected space. Keywords:
Johnson-Lindenstrauss Lemma, random projection, sampledistortion, dimensionality reduction, linear JL transform, k -means algorithm,clusterability retention, 1 a r X i v : . [ c s . D S ] N ov Introduction
Dimensionality reduction plays an important role in many areas of data pro-cessing, and especially in machine learning (cluster analysis, classifier learn-ing, model validation, data visualisation etc.).Usually it is associated with manifold learning, that is a belief that thedata lie in fact in a low dimensional subspace that needs to be identifiedand the data projected onto it so that the number of degrees of freedomis reduced and as a consequence also sample sizes can be smaller withoutloss of reliability. Techniques like reduced k -means [17], PCA (PrincipalComponent Analysis), Kernel PCA, LLE (Locally Linear Embedding), LEM(Laplacian Eigenmaps), MDS (Metric Multidimensional Scaling), Isomap,SDE (Semidefinite Embedding), just to mention a few.But there exists still another possibility of approaching the dimensionalityreduction problems, in particular when such intrinsic subspace where datais located cannot be identified. The problem of choice of the subspace hasbeen surpassed by several authors by so-called random projection, applicablein particularly highly dimensional spaces (tens of thousands of dimensions)and correspondingly large data sets (of at least hundreds of points).The starting point here is the Johnson-Lindenstrauss Lemma [13]. Roughlyspeaking it states that there exists a linear mapping from a higher dimen-sional space into a sufficiently high dimensional subspace that will preserveapproximately the distances between points, as needed e.g. by k -means al-gorithm [4].To be more formal consider a set Q of m objects Q = { , . . . , m } . Anobject i ∈ Q may have a representation x i ∈ R n . Then the set of these repre-sentations will be denoted by Q . An object i ∈ Q may have a representation x (cid:48) i ∈ R n (cid:48) , in a different space. Then the set of these representations will bedenoted by Q (cid:48) .With this notation let us state: Theorem 1. (Johnson-Lindenstrauss) Let δ ∈ (0 , ) . Let Q be a set of m objects and Q - a set of points representing them in R n , and let n (cid:48) ≥ C ln mδ ,where C is a sufficiently large constant (e.g.20). There exists a Lipschitzmapping f : R n → R n (cid:48) such that for all u , v ∈ Q (1 − δ ) (cid:107) u − v (cid:107) ≤ (cid:107) f ( u ) − f ( v ) (cid:107) ≤ (1 + δ ) (cid:107) u − v (cid:107) (1)A number of proofs and applications of this theorem have been proposedwhich in fact do not prove the theorem as such but rather create a prob- JL Lemma speaks about a general transformation, but many researchers look just forlinear ones. n (cid:48) vectors with random coordinates are sampled from the original n -dimensional space and one uses them as a coordinate system in the n (cid:48) -dimensional subspace which is a much simpler process. One hopes that thesampled vectors will be orthogonal (and hence the coordinate system willbe orthogonal) which in case of vectors with thousands of coordinates isreasonable. That means we create a matrix M of n (cid:48) rows and n columnsas follows: for each row i we sample n numbers from N (0 , forming a rowvector a Ti . We normalize it obtaining the row vector b Ti = a Ti · ( a Ti a i ) − / .This becomes the i th row of the matrix M . Then for any data point x in theoriginal space its random projection is obtained as x (cid:48) = M x .Then the mapping we seek is the projection multiplied by a suitablefactor.It is claimed afterwards that this mapping is distance-preserving not onlyfor a single vector, but also for large sets of points with some, usually verysmall probability, as Dasgupta and Gupta [11] maintain. Via applying theabove process many times one can finally get the mapping f that is needed.That is each time we sample a subspace from the space of subspaces andcheck if condition expressed by equation (1) holds for all the points, and ifnot, we sample again, while we have the reasonable hope that we will get thesubspace of interest after a finite number of steps with probability that weassume.In this paper we explore the following flaw of the mentioned approach:If we want to apply for example a k -means clustering algorithm, we are infact not interested in resampling the subspaces in order to find a convenientone so that the distances are sufficiently preserved. Computation over andover again of m / distances between the points in the projected space mayturn out to be much more expensive than computing O ( mk ) distances during k -means clustering (if m (cid:29) k ) in the original space. In fact we are primarilyinterested in clustering data. But we do not have any criterion for the k -means algorithm that would say that this particular subspace is the rightone via e.g. minimization of k -means criterion (and in fact for any other3lustering algorithm).Therefore, we rather seek a scheme that will allow us to say that by acertain random sampling we have already found the subspace that we soughtwith a sufficiently high probability. As far as we know, this is the first timesuch a problem has been posed.To formulate claims concerning k -means, we need to introduce additionalnotation. Let us denote with C a partition of Q into k clusters { C , . . . , C k } .For any i ∈ Q let C ( i ) denote the cluster C j to which i belongs. For any setof objects C j let µ ( C j ) = | C j | (cid:80) i ∈ C j x i and µ (cid:48) ( C j ) = | C j | (cid:80) i ∈ C j x (cid:48) i .Under this notation the k -means cost function may be written as J ( Q, C ) = (cid:88) i ∈ Q (cid:107) x i − µ ( C ( i )) (cid:107) (2) J ( Q (cid:48) , C ) = (cid:88) i ∈ Q (cid:107) x (cid:48) i − µ (cid:48) ( C ( i ) (cid:107) (3)for the sets Q, Q (cid:48) .Our contribution is as follows: • We formulate and prove a set version of JL Lemma - see Theorem 6. • Based on it we demonstrate that a good solution to k -means problemin the projected space is also a good one in the original space - seeTheorem 2. • We show that local k -means minima in the original and the projectedspaces match under proper conditions - see Theorems 3, 4. • We demonstrate that a perfect k -means algorithm in the projectedspace is a constant factor approximation of the global optimum in theoriginal space - see Theorem 5 • We prove that the projection preserves several clusterability properties- see Theorems 9, 7, 8. 10 and 11.For k -means in particular we make the following claim: Theorem 2.
Let Q be a set of m representatives of objects from Q in an n -dimensional orthogonal coordinate system C n . Let δ ∈ (0 , ) , (cid:15) ∈ (0 , .and let n (cid:48) ≥ − ln (cid:15) + 2 ln( m ) − ln(1 + δ ) + δ (4) Let C n (cid:48) be a randomly selected (via sampling from a normal distribution) n (cid:48) -dimensional orthogonal coordinate system. Let the set Q (cid:48) consist of m objects uch that for each i ∈ Q , x (cid:48) i ∈ Q (cid:48) is a projection of x i ∈ Q onto C n (cid:48) . If C isa partition of Q , then (1 − δ ) J ( Q, C ) ≤ nn (cid:48) J ( Q (cid:48) , C ) ≤ (1 + δ ) J ( Q, C ) (5) holds with probability of at least − (cid:15) . Note that the inequality (5) can be rewitten as (cid:18) − δ δ (cid:19) J ( Q (cid:48) , C ) ≤ n (cid:48) n J ( Q, C ) ≤ (cid:18) δ − δ (cid:19) J ( Q (cid:48) , C ) Furthermore
Theorem 3.
Under the assumptions and notation of Theorem 2, if the parti-tion C ∗ constitutes a local minimum of J ( Q, C ) over C (in the original space)and if for any two clusters g = 2(1 − α ) times half of the distance betweentheir centres is the gap between these clusters, where α ∈ [0 , , and δ ≤ − (cid:0) − g (cid:1) (cid:0) − g (cid:1) + (1 + 2 p ) (6) ( p to be defined later by inequality (14)) then this same partition is (in theprojected space) also a local minimum of J ( Q (cid:48) , C ) over C , with probability ofat least − (cid:15) . Theorem 4.
Under the assumptions and notation of Theorem 2, if the clus-tering C (cid:48)∗ constitutes a local minimum of J ( Q (cid:48) , C ) over C (in the projectedspace) and if for any two clusters − α times the distance between theircentres is the gap between these clusters, where α ∈ [0 , , and δ − δ ≤ − α (1 + 2 p ) + α (7) then the very same partition C (cid:48)∗ is also (in the original space) a local mini-mum of J ( Q, C ) over C , with probability of at least − (cid:15) . Theorem 5.
Under the assumptions and notation of Theorem 2, if C G de-notes the clustering reaching the global optimum in the original space, and (cid:48) G denotes the clustering reaching the global optimum in the projected space,then nn (cid:48) J ( Q (cid:48) , C (cid:48) G ) ≤ (1 + δ ) J ( Q, C G ) (8) with probability of at least − (cid:15) .That is the perfect k -means algorithm in the projected space is a constantfactor approximation of k -means optimum in the original space. We postpone the proof of the theorems 2-5 till section 3, as we need firstto derive the basic theorem 6 in section 2 which is essentially based on theresults reported by Dasgupta and Gupta [11].Let us however stress at this point the significance of these theorems.Earlier forms of JL lemma required sampling of the coordinates over andover again , with quite a low success rate until a mapping is found fittingthe error constraints. In our theorems, we need only one sampling in orderto achieve the required success probability of selecting a suitable subspace toperform k -means. In Section 5 we illustrate this advantage by some numericalsimulation results, showing at the same time the impact of various parametersof Jonson-Lindenstrauss Lemma on the dimensionality of the projected space.In Section 6 we recall the corresponding results of other authors.In Section 4 we demonstrate an additional advantage of our version of JLlemma consisting in preservation of various clusterability criteria.Section 7 contains some concluding remarks. Let us present the process of seeking the mapping f from Theorem 1 in amore detailed manner, so that we can then switch to our target of selectingthe size of the subspace guaranteeing that the projected distances preservetheir proportionality in the required range.Let us consider first a single vector x = ( x , .x , ..., x n ) of n independentrandom variables drawn from the normal distribution N (0 , with mean 0and variance 1. Let x (cid:48) = ( x , .x , ..., x n (cid:48) ) , where n (cid:48) < n , be its projectiononto the first n (cid:48) coordinates.Dasgupta and Gupta [11] in their Lemma 2.2 demonstrated that for apositive β Though in passing a similar result is claimed in Lemma 5.3 http://math.mit.edu/~bandeira/2015_18.S096_5_Johnson_Lindenstrauss.pdf , though without an explicitproof. if β < then P r ( (cid:107) x (cid:48) (cid:107) ≤ β n (cid:48) n (cid:107) x (cid:107) ) ≤ β n (cid:48) (cid:18) n (cid:48) (1 − β ) n − n (cid:48) (cid:19) n − n (cid:48) (9) • if β > then P r ( (cid:107) x (cid:48) (cid:107) ≥ β n (cid:48) n (cid:107) x (cid:107) ) ≤ β n (cid:48) (cid:18) n (cid:48) (1 − β ) n − n (cid:48) (cid:19) n − n (cid:48) (10)Now imagine we want to keep the error of squared length of x boundedwithin a range of ± δ (relative error) upon projection, where δ ∈ (0 , . Thenwe get the probability P r (cid:16) (1 − δ ) (cid:107) x (cid:107) ≤ nn (cid:48) (cid:107) x (cid:48) (cid:107) ≤ (1 + δ ) (cid:107) x (cid:107) (cid:17) ≥ − (1 − δ ) n (cid:48) (cid:18) n (cid:48) δn − n (cid:48) (cid:19) n − n (cid:48) − (1 + δ ) n (cid:48) (cid:18) − n (cid:48) δn − n (cid:48) (cid:19) n − n (cid:48) This implies
P r (cid:16) (1 − δ ) (cid:107) x (cid:107) ≤ nn (cid:48) (cid:107) x (cid:48) (cid:107) ≤ (1 + δ ) (cid:107) x (cid:107) (cid:17) ≥ − (1 − δ ) n (cid:48) (cid:18) n (cid:48) δn − n (cid:48) (cid:19) n − n (cid:48) , (1 + δ ) n (cid:48) (cid:18) − n (cid:48) δn − n (cid:48) (cid:19) n − n (cid:48) = 1 − δ ∗ ∈{− δ, + δ } (1 − δ ∗ ) n (cid:48) (cid:18) δ ∗ n (cid:48) n − n (cid:48) (cid:19) n − n (cid:48) The same holds if we scale the vector x .Now if we have a sample consisting of m points in space, without howevera guarantee that coordinates are independent between the vectors then we7ant that the probability that squared distances between all of them liewithin the relative range ± δ is higher than − (cid:15) ≤ − (cid:18) m (cid:19) (cid:16) − P r (cid:16) (1 − δ ) (cid:107) x (cid:107) ≤ nn (cid:48) (cid:107) x (cid:48) (cid:107) ≤ (1 + δ ) (cid:107) x (cid:107) (cid:17)(cid:17) (11)for some failure probability term (cid:15) ∈ (0 , .To achieve this, it is sufficient that the following holds: (cid:15) ≥ (cid:18) m (cid:19) max δ ∗ ∈{− δ, + δ } (1 − δ ∗ ) n (cid:48) (cid:18) δ ∗ n (cid:48) n − n (cid:48) (cid:19) n − n (cid:48) Taking logarithm ln (cid:15) ≥ ln( m ( m − δ ∗ ∈{− δ, + δ } (cid:18) n (cid:48) − δ ∗ ) + ( n − n (cid:48) )2 ln (cid:18) δ ∗ n (cid:48) n − n (cid:48) (cid:19)(cid:19) ln (cid:15) − ln( m ( m − ≥ max δ ∗ ∈{− δ, + δ } (cid:18) n (cid:48) − δ ∗ ) + ( n − n (cid:48) )2 ln (cid:18) δ ∗ n (cid:48) n − n (cid:48) (cid:19)(cid:19) We know that ln(1 + x ) < x for x > − and x (cid:54) = 0 , hence the aboveholds if ln (cid:15) − ln( m ( m − ≥ max δ ∗ ∈{− δ, + δ } (cid:18) n (cid:48) − δ ∗ ) + ( n − n (cid:48) )2 δ ∗ n (cid:48) n − n (cid:48) (cid:19) ln (cid:15) − ln( m ( m − ≥ max δ ∗ ∈{− δ, + δ } (cid:18) n (cid:48) − δ ∗ ) + 12 ( δ ∗ ) n (cid:48) (cid:19) = n (cid:48) δ ∗ ∈{− δ, + δ } (ln(1 − δ ∗ ) + δ ∗ ) Recall that also we have ln(1 − x ) + x < for x < and x (cid:54) = 0 , threfore max δ ∗ ∈{− δ, + δ } (cid:18) (cid:15) − ln( m ( m − − δ ∗ ) + δ ∗ (cid:19) ≤ n (cid:48) We speak about a success if all the projected data points lie within the range definedby formula (1). Otherwise we speak about failure (even if only one data point lies outsidethis range). Please recall at this point the Taylor expansion ln(1 + x ) = x − x / x / − x / . . . which converges in the range (-1,1) and hence implies ln(1 + x ) < x for x ∈ ( − , ∪ (0 , as we will refer to it discussing difference to JL theorems of other authors.
8o finally, realizing that − ln(1 − δ ) − δ ≥ − ln(1 + δ ) + δ > , and that ln( m ( m − < m ) we get as sufficient condition n (cid:48) ≥ − ln (cid:15) + 2 ln( m ) − ln(1 + δ ) + δ Note that this expression does not depend on n that is the number ofdimensions in the projection is chosen independently of the original numberof dimensions .So we are ready to formulate our major finding of this paper Theorem 6.
Let δ ∈ (0 , ) , (cid:15) ∈ (0 , . Let Q ⊂ R n be a set of m points in an n -dimensional orthogonal coordinate system C n and let (as in formula (4)) n (cid:48) ≥ − ln (cid:15) + 2 ln( m ) − ln(1 + δ ) + δ Let C n (cid:48) be a randomly selected (via sampling from a normal distribution) n (cid:48) -dimensional orthogonal coordinate system. For each v ∈ Q let v (cid:48) be itsprojection onto C n (cid:48) . Then for all pairs u , v ∈ Q (1 − δ ) (cid:107) u − v (cid:107) ≤ nn (cid:48) (cid:107) u (cid:48) − v (cid:48) (cid:107) ≤ (1 + δ ) (cid:107) u − v (cid:107) (12) holds with probability of at least − (cid:15) The permissible error δ will surely depend on the target application. Let usconsider the context of k -means. First we claim for k -means, that the JLLemma applies not only to data points but also to cluster centres. We substituted the denominator with a smaller positive number and the nominatorwith a larger positive number so that the fraction value increases so that a higher n (cid:48) willbe required than actually needed. Though in passing a similar result is claimed in Lemma 5.3 http://math.mit.edu/~bandeira/2015_18.S096_5_Johnson_Lindenstrauss.pdf , though without an explicitproof. They propose that n (cid:48) ≥ (2 + r ) 2 ln( m ) − ln(1 + δ ) + δ in order to get a failure rate below m − r . In fact when we substitute (cid:15) = m − r , bothformulas are the same. However, usage of (cid:15) alows for control of failure rate in the othertheorems in this paper, while r does not make this possibility obvious. Also fixing r versusfixing (cid:15) impacts disadvantageously the growth rate of n (cid:48) with m . emma 1. Let δ ∈ (0 , ) , (cid:15) ∈ (0 , . Let Q ⊂ R n be a set of m repre-sentatives of elements of Q in an n -dimensional orthogonal coordinate sys-tem C n and let the inequality (4) hold. Let C n (cid:48) be a randomly selected (viasampling from a normal distribution) n (cid:48) -dimensional orthogonal coordinatesystem. For each x i ∈ Q let x (cid:48) i ∈ Q (cid:48) be its projection onto C n (cid:48) . Let C be apartition of Q . Then for all data points x i ∈ Q (1 − δ ) (cid:107) x i − µ ( C ( i )) (cid:107) ≤ nn (cid:48) (cid:107) x (cid:48) i − µ (cid:48) ( C ( i )) (cid:107) ≤ (1 + δ ) (cid:107) x i − µ ( C ( i )) (cid:107) (13) hold with probability of at least − (cid:15) ,Proof. As we know, data points under k -means are assigned to clusters havingthe closest cluster centre. On the other hand the cluster centre µ is theaverage of all the data point representatives in the cluster.Hence the cluster element i has the squared distance to its cluster centre µ ( C ( i )) amounting to (cid:107) x i − µ ( C ( i )) (cid:107) = 1 | C ( i ) | (cid:88) j ∈ C ( i ) (cid:107) x i − x j (cid:107) But according to Theorem 6 (1 − δ ) (cid:88) j ∈ C ( i ) (cid:107) x i − x j (cid:107) ≤ nn (cid:48) (cid:88) j ∈ C ( i ) (cid:107) x (cid:48) i − x (cid:48) j (cid:107) ≤ (1 + δ ) (cid:88) j ∈ C ( i ) (cid:107) x i − x j (cid:107) Hence (1 − δ ) (cid:107) x i − µ ( C ( i )) (cid:107) ≤ nn (cid:48) (cid:107) x (cid:48) i − µ (cid:48) ( C ( i )) (cid:107) ≤ (1 + δ ) (cid:107) x i − µ ( C ( i )) (cid:107) Note that here µ (cid:48) ( C ( i )) is not the projective image of µ ( C ( i )) , but rather thecentre of projected images of cluster elements.The Lemma 1 permits us to prove Theorem 2 Proof. (Theorem 2)
According to formula (13): (1 − δ ) (cid:107) x i − µ ( C ( i )) (cid:107) ≤ nn (cid:48) (cid:107) x (cid:48) i − µ (cid:48) ( C ( i )) (cid:107) ≤ (1 + δ ) (cid:107) x i − µ ( C ( i )) (cid:107) Hence (cid:88) i ∈ Q (1 − δ ) (cid:107) x i − µ ( C ( i )) (cid:107) ≤ (cid:88) i ∈ Q nn (cid:48) (cid:107) x (cid:48) i − µ (cid:48) ( C ( i )) (cid:107) ≤ (cid:88) i ∈ Q (1+ δ ) (cid:107) x i − µ ( C ( i )) (cid:107) − δ ) (cid:88) i ∈ Q (cid:107) x i − µ ( C ( i )) (cid:107) ≤ (cid:88) i ∈ Q nn (cid:48) (cid:107) x (cid:48) i − µ (cid:48) ( C ( i )) (cid:107) ≤ (1+ δ ) (cid:88) i ∈ Q (cid:107) x i − µ ( C ( i )) (cid:107) Based on defining equations (2) and (3) we get the formula (5) (1 − δ ) J ( Q, C ) ≤ nn (cid:48) J ( Q (cid:48) , C ) ≤ (1 + δ ) J ( Q, C ) Let us now investigate the distance between centres of two clusters, say C , C . Let their cardinalities amount to m , m respectively. Denote C = C ∪ C . Consequently m = | C | = m + m . For a set C j let V AR ( C j ) = | C j | (cid:80) i ∈ C j (cid:107) x i − µ ( C j ) (cid:107) and V AR (cid:48) ( C j ) = | C j | (cid:80) i ∈ C j (cid:107) x (cid:48) i − µ (cid:48) ( C j ) (cid:107) .Therefore V AR ( C ) = 1 | C | (cid:88) i ∈ C (cid:107) x i − µ ( C ) (cid:107) = 1 | C | (cid:32)(cid:32)(cid:88) i ∈ C (cid:107) x i − µ ( C ) (cid:107) (cid:33) + (cid:32)(cid:88) i ∈ C (cid:107) x i − µ ( C ) (cid:107) (cid:33)(cid:33) By inserting a zero = 1 | C | (cid:32)(cid:32)(cid:88) i ∈ C (cid:107) x i − µ ( C ) + µ ( C ) − µ ( C ) (cid:107) (cid:33) + (cid:32)(cid:88) i ∈ C (cid:107) x i − µ ( C ) (cid:107) (cid:33)(cid:33) = 1 | C | (cid:32)(cid:32)(cid:88) i ∈ C (cid:0) ( x i − µ ( C )) + ( µ ( C ) − µ ( C )) + 2( x i − µ ( C ))( µ ( C ) − µ ( C )) (cid:1)(cid:33) + (cid:32)(cid:88) i ∈ C (cid:107) x i − µ ( C ) (cid:107) (cid:33)(cid:33) = 1 | C | (cid:32)(cid:32)(cid:32)(cid:88) i ∈ C ( x i − µ ( C )) (cid:33) + (cid:32)(cid:88) i ∈ C ( µ ( C ) − µ ( C )) (cid:33) +2( (cid:88) i ∈ C x i − (cid:88) i ∈ C µ ( C ))( µ ( C ) − µ ( C )) (cid:33)(cid:33) + (cid:32)(cid:88) i ∈ C (cid:107) x i − µ ( C ) + µ ( C ) − µ ( C ) (cid:107) (cid:33) | C | (cid:32)(cid:32)(cid:32)(cid:88) i ∈ C ( x i − µ ( C )) (cid:33) + | C | ( µ ( C ) − µ ( C )) +2( | C | µ ( C ) − | C | µ ( C ))( µ ( C ) − µ ( C ))) + (cid:32)(cid:88) i ∈ C (cid:107) x i − µ ( C ) (cid:107) (cid:33)(cid:33) = 1 | C | (cid:32)(cid:0) V AR ( C ) | C | + | C | ( µ ( C ) − µ ( C )) (cid:1) + (cid:32)(cid:88) i ∈ C (cid:107) x i − µ ( C ) (cid:107) (cid:33)(cid:33) Via the same reasonig we get: = 1 | C | (cid:0)(cid:0) V AR ( C ) | C | + | C | ( µ ( C ) − µ ( C )) (cid:1) + (cid:0) V AR ( C ) | C | + | C | ( µ ( C ) − µ ( C )) (cid:1)(cid:1) = 1 | C | ( V AR ( C ) | C | + V AR ( C ) | C | + | C | ( µ ( C ) − µ ( C )) + | C | ( µ ( C ) − µ ( C )) (cid:1) As Apparently µ ( C ) = | C | (cid:80) i ∈ C x i = | C | (cid:0) ( (cid:80) i ∈ C x i ) + ( (cid:80) i ∈ C x i ) (cid:1) = | C | ( | C | µ ( C ) + | C | µ ( C )) that is µ ( C ) = | C || C | µ ( C ) + | C || C | µ ( C , we get = 1 | C | (cid:32) V AR ( C ) | C | + V AR ( C ) | C | + | C | (cid:18) µ ( C ) − | C || C | µ ( C ) − | C || C | µ ( C (cid:19) + | C | (cid:18) µ ( C ) − | C || C | µ ( C ) − | C || C | µ ( C ) (cid:19) (cid:33) = 1 | C | (cid:32) V AR ( C ) | C | + V AR ( C ) | C | + | C | (cid:18) | C || C | µ ( C ) − | C || C | µ ( C ) (cid:19) + | C | (cid:18) − | C || C | µ ( C ) + | C || C | µ ( C ) (cid:19) (cid:33) = 1 | C | (cid:18) V AR ( C ) | C | + V AR ( C ) | C | + | C || C | + | C | | C || C | ( µ ( C ) − µ ( C )) (cid:19) V AR ( C ) == 1 | C | (cid:18) V AR ( C ) | C | + V AR ( C ) | C | + | C || C || C | ( µ ( C ) − µ ( C )) (cid:19) This leads immediately to
V AR ( C ) · m = V AR ( C ) · m + V AR ( C ) · m + m · m /m ·(cid:107) µ ( C ) − µ ( C ) (cid:107) which implies V AR ( C ) · m m · m = V AR ( C ) · m m + V AR ( C ) · m m + (cid:107) µ ( C ) − µ ( C ) (cid:107) According to Lemma 1, applied to the set C as a cluster, (1 − δ ) (cid:0) V AR ( C ) · m /m + V AR ( C ) · m /m + (cid:107) µ ( C ) − µ ( C ) (cid:107) (cid:1) ≤ nn (cid:48) (cid:0) V AR (cid:48) ( C ) · m /m + V AR (cid:48) ( C ) · m /m + (cid:107) µ (cid:48) ( C ) − µ (cid:48) ( C ) (cid:107) (cid:1) ≤ (1 + δ ) (cid:0) V AR ( C ) · m /m + V AR ( C ) · m /m + (cid:107) µ ( C ) − µ ( C ) (cid:107) (cid:1) and with respect to C , C combined (1 − δ ) ( V AR ( C ) · m /m + V AR ( C ) · m /m ) ≤ nn (cid:48) ( V AR (cid:48) ( C ) · m /m + V AR (cid:48) ( C ) · m /m ) ≤ (1 + δ ) ( V AR ( C ) · m /m + V AR ( C ) · m /m ) These two last equations mean that − δ ( V AR ( C ) · m /m + V AR ( C ) · m /m ) + (1 − δ ) (cid:107) µ ( C ) − µ ( C ) (cid:107) ≤ nn (cid:48) (cid:0) (cid:107) µ (cid:48) ( C ) − µ (cid:48) ( C ) (cid:107) (cid:1) ≤ δ ( V AR ( C ) · m /m + V AR ( C ) · m /m ) + (1 + δ ) (cid:107) µ ( C ) − µ ( C ) (cid:107) Let us assume that the quotient
V AR ( C ) · m /m + V AR ( C ) · m /m (cid:107) µ ( C ) − µ ( C ) (cid:107) ≤ p (14)where p is some positive number. So we have in effect (1 − δ (1+2 p )) (cid:107) µ ( C ) − µ ( C ) (cid:107) ≤ nn (cid:48) (cid:0) (cid:107) µ (cid:48) ( C ) − µ (cid:48) ( C ) (cid:107) (cid:1) ≤ (1+ δ (1+2 p )) (cid:107) µ ( C ) − µ ( C ) (cid:107) Under balanced ball-shaped clusters p does not exceed 1. So we haveshown the lemma 13 emma 2. Under the assumptions of preceding lemmas for any two clusters C , C (1 − δ (1+2 p )) (cid:107) µ ( C ) − µ ( C ) (cid:107) ≤ nn (cid:48) (cid:0) (cid:107) µ (cid:48) ( C ) − µ (cid:48) ( C ) (cid:107) (cid:1) ≤ (1+ δ (1+2 p )) (cid:107) µ ( C ) − µ ( C ) (cid:107) (15) where p depends on degree of balance between clusters and cluster shape, holdswith probability at least − (cid:15) . Now let us consider the choice of δ in such a way that with high probabilityno data point will be classified into some other cluster. We claim the following Lemma 3.
Consider two clusters C , C . Let δ ∈ (0 , ) , (cid:15) ∈ (0 , .Let Q ⊂ R n be a set of m points in an n -dimensional orthogonal coordinatesystem C n and let the inequality (4) hold. Let C n (cid:48) be a randomly selected (viasampling from a normal distribution) n (cid:48) -dimensional orthogonal coordinatesystem. For each x i ∈ Q let v (cid:48) be its projection onto C n (cid:48) . For two clusters C , C , obtained via k -means, in the original space let µ , µ be their centresand µ (cid:48) , µ (cid:48) be centres to the correspondings sets of projected cluster members.Furthermore let d be the distance of the first cluster centre to the commonborder of both clusters and let the closest point of the first cluster to thisborder be at the distance of αd from its cluster centre as projected on the lineconnecting both cluster centres, where α ∈ (0 , .Then all projected points of the first cluster are (each) closer to the centre ofthe set of projected points of the first than to the centre of the set of projectedpoints of the second if δ ≤ − (cid:0) − g (cid:1) (cid:0) − g (cid:1) + (1 + 2 p ) = 1 − α (1 + 2 p ) + α (16) where g = 2(1 − α ) , with probability of at least − (cid:15) .Proof. Consider a data point x ”close” to the border between the two neigh-bouring clusters, on the line connecting the cluster centres, belonging to thefirst cluster, at a distance αd from its cluster centre, where d is the distanceof the first cluster centre to the border and α ∈ (0 , . The squared distancebetween cluster centres, under projection, can be ”reduced” by the factor − δ , (beside the factor nn (cid:48) which is common to all the points) whereas thesquared distance of x to its cluster centre may be ”increased” by the factor δ . This implies a relationship between the factor α and the error δ .If x (cid:48) should not cross the border between the clusters, the following needsto hold: 14 x (cid:48) − µ (cid:48) (cid:107) ≤ (cid:107) µ (cid:48) − µ (cid:48) (cid:107) (17)which implies: nn (cid:48) (cid:107) x (cid:48) − µ (cid:48) (cid:107) ≤ nn (cid:48) (cid:107) µ (cid:48) − µ (cid:48) (cid:107) As (see Lemma 1) nn (cid:48) (cid:107) x (cid:48) − µ (cid:48) (cid:107) ≤ (1 + δ ) (cid:107) x − µ (cid:107) = (1 + δ )( αd ) and (see Lemma 2) nn (cid:48) (cid:107) µ (cid:48) − µ (cid:48) (cid:107) ≥ (1 − δ (1 + 2 p )) 14 (cid:107) µ − µ (cid:107) = (1 − δ (1 + 2 p )) d we know that, for inequality (17) to hold, it is sufficient that: (1 + δ )( αd ) ≤ (1 − δ (1 + 2 p )) d that is α ≤ (cid:114) − δ (1 + 2 p )1 + δ But − α ) d or − α ) can be viewed as absolute or relative gap betweenclusters. So if we expect a relative gap g = 2(1 − α ) between clusters, wehave to choose δ in such a way that − g ≤ (cid:114) − δ (1 + 2 p )1 + δ Therefore δ ≤ − (cid:0) − g (cid:1) (cid:0) − g (cid:1) + (1 + 2 p ) (18)So we see that the decision on the permitted error depends on the size ofthe gap between clusters that we hope to observe.The Lemma 3 allows us to prove Theorem 3 in a straight forward manner. Proof. (Theorem 3)
Observe that in this theorem we impose the conditionof this lemma on each cluster. So all projected points are closer to their setcentres than to any other centre. So the k -means algorithm would get stuckat this clustering and hence we get at a local minimum.15 emma 4. Let δ ∈ (0 , ) , (cid:15) ∈ (0 , . Let Q ⊂ R n be a set of m points inan n -dimensional orthogonal coordinate system C n and let the inequility (4)hold. Let C n (cid:48) be a randomly selected (via sampling from a normal distribution) n (cid:48) -dimensional orthogonal coordinate system. For each x i ∈ Q let v (cid:48) be itsprojection onto C n (cid:48) . For any two k -means clusters C , C in the projectedspace let µ (cid:48) , µ (cid:48) be their centres in the projected space and µ , µ be centres tothe corresponding sets of cluster members in the original space. Furthermorelet d be the distance of the first cluster centre to the common border of bothclusters in the projected space and let the closest point of the first cluster tothis border in that space be at the distances of αd from its cluster centre,where α ∈ [0 , .Then all points of the first cluster in the original space are (each) closer tothe centre of the set of points of the first than to the centre of the set of pointsof the second cluster in the original space if δ ≤ − (cid:16) − (2 − α )2 (cid:17) (cid:16) − (2 − α )2 (cid:17) + (1 + 2 p ) = 1 − α (1 + 2 p ) + α (19) with probability of at least − (cid:15) .Proof. Consider a data point x (cid:48) ”close” to the border between the two neigh-bouring clusters in the projected space, on the line connecting the cluster cen-tres, belonging to the first cluster, at a distance αd from its cluster centre,where d is the distance of the first cluster centre to the border and α ∈ (0 , .The squared distance between cluster centres, in original space, can be ”re-duced” by the factor (1 + δ ) − (beside the factor n (cid:48) n which is common to allthe points), whereas the squared distance of x to its cluster centre may be”increased” by the factor (1 − δ ) − . This implies a relationship between thefactor α and the error δ .If x (in the original space) should not cross the border between the clus-ters, the following needs to hold: (cid:107) x − µ (cid:107) ≤ (cid:107) µ − µ (cid:107) (20)which implies: n (cid:48) n (cid:107) x − µ (cid:107) ≤ n (cid:48) n (cid:107) µ − µ (cid:107) As (see Lemma 1) n (cid:48) n (cid:107) x − µ (cid:107) ≤ (1 − δ ) − (cid:107) x (cid:48) − µ (cid:48) (cid:107) = (1 − δ ) − ( αd ) n (cid:48) n (cid:107) µ − µ (cid:107) ≥ (1 + δ (1 + 2 p )) − (cid:107) µ (cid:48) − µ (cid:48) (cid:107) = (1 + δ (1 + 2 p )) − d Thus, we know that, for inequality (20) to hold, it is sufficient that: (1 − δ ) − ( αd ) ≤ (1 + δ (1 + 2 p )) − d that is α ≤ (cid:115) − δ δ (1 + 2 p ) But − α ) d or − α ) can be viewed as absolute or relative gap betweenclusters. So if we want to have a relative gap g = 2(1 − α ) between clusters,we have to choose δ in such a way that − g ≤ (cid:115) − δ δ (1 + 2 p ) Therefore δ ≤ − (cid:0) − g (cid:1) (cid:0) − g (cid:1) + (1 + 2 p ) (21)The Lemma 4 allows us to prove Theorem 4 in a straight forward manner. Proof. (Theorem 4)
Observe that in this theorem we impose the conditionof this lemma on each cluster. So all original space points are closer to theirset centres than to any other centre. So the k -means algorithm would getstuck at this clustering and hence we get at a local minimum.Having these results, we can go over to the proof of the Theorem 5. Proof. (Theorem 5)
Let C G denote the clustering reaching the global opti-mum in the original space. Let C (cid:48) G denote the clustering reaching the globaloptimum in the projected space. From the Theorem 2 we have that (1 − δ ) J ( Q, C G ) ≤ nn (cid:48) J ( Q (cid:48) , C G ) ≤ (1 + δ ) J ( Q, C G ) On the other hand (1 + δ ) − J ( Q (cid:48) , C (cid:48) G ) ≤ n (cid:48) n J ( Q, C (cid:48) G ) ≤ (1 − δ ) − J ( Q (cid:48) , C (cid:48) G ) C (cid:48) G is the global minimum in the projected space, hence J ( Q (cid:48) , C (cid:48) G ) ≤ J ( Q (cid:48) , C G ) So nn (cid:48) J ( Q (cid:48) , C (cid:48) G ) ≤ nn (cid:48) J ( Q (cid:48) , C G ) ≤ (1 + δ ) J ( Q, C G ) So nn (cid:48) J ( Q (cid:48) , C (cid:48) G ) ≤ (1 + δ ) J ( Q, C G ) Note that analogously, C G is the global minimum in the original space,hence J ( Q, C G ) ≤ J ( Q, C (cid:48) G ) n (cid:48) n J ( Q, C G ) ≤ n (cid:48) n J ( Q, C (cid:48) G ) ≤ (1 − δ ) − J ( Q (cid:48) , C (cid:48) G ) (22) In the literature a number of notions of so-called clusterability have beenintroduced. Under these notions of clusterability algorithms have been de-veloped clustering the data nearly optimally in polynomial times, when someconstraints are matched by the clusterability parameters.It seems therefore worth to have a look at the issue if the aforementionedprojection technique would affect the clusterability property of the data sets.Let us consider, as representatives, the following notions of clusterability,present in the literature: • Perturbation Robustness meaning that small perturbations of distances/ positions in space of set elements do not result in a change of theoptimal clustering for that data set. Two brands may be distinguished:additive [2] and multiplicative ones [9] (the limit of perturbation isupper-bounded either by an absolute value or by a coefficient).The s -Multiplicative Perturbation Robustness ( < s < holds fora data set with d being its distance function if the following holds.Let C be an optimal clustering of data points for this distance. Let d be any distance function over the same set of points such that for anytwo points u , v , s · d ( u , v ) < d ( u , v ) < s · d ( u , v ) . Then the sameclustering C is optimal under the distance function d .18he s -Additive Perturbation Robustness ( < s < holds for a dataset with d being its distance function if the following holds. Let C be an optimal clustering of data points for this distance. Let d beany distance function over the same set of points such that for any twopoints u , v , d ( u , v ) − s < d ( u , v ) < · d ( u , v ) + s . Then the sameclustering C is optimal under the distance function d .Subsequently we are interested only in the multiplicative version. • σ -Separatedness [16] meaning that the cost J ( Q, C k ) of optimal clus-tering C k of the data set Q into k clusters is less than σ ( < σ < )times the cost J ( Q, C k − ) of optimal clustering C k − into k − clusters J ( Q, C k − ) < σ J ( Q, C k − ) • ( c, σ ) -Approximation-Stability [7] meaning that if the cost functionvalues of two partitions C a , C b differ by at most the factor c > (thatis c · J ( Q, C a ) ≥ J ( Q, C b ) and c · J ( Q, C b ) ≥ J ( Q, C a ) ), then the distance(in some space) between the partitions is at most σ ( d ( C a , C b ) < σ for some distance function d between partiitions). As Ben-David [8]recalls, this implies the uniqueness of optimal solution. • β -Centre Stability [6] meaning, for any centric clustering, that thedistance of an element to its cluster centre is β > times smaller thanthe distance to any other cluster centre under optimal clustering. • (1 + β ) Weak Deletion Stability [5] ( β > ) meaning that given an op-timal cost function value OP T for k centric clusters, the cost functionof a clustering obtained by deleting one of the cluster centres and as-signing elements of that cluster to one of the remaining clusters shouldbe bigger than (1 + β ) · OP T .Let us first have a look at the σ -Separatedness. Let C G , k denote an optimalclustering into k clusters in the original space and C (cid:48) G , k in the projected space.From properties of k -means we know that J ( Q, C G , k ) ≤ J ( Q, C G , k − ) and J ( Q (cid:48) , C (cid:48) G , k ) ≤ J ( Q (cid:48) , C (cid:48) G , k − ) . From theorem 5 we know that nn (cid:48) J ( Q (cid:48) , C (cid:48) G , k ) ≤ (1 + δ ) J ( Q, C G , k ) and J ( Q, C G , k − ) ≤ nn (cid:48) (1 − δ ) − J ( Q (cid:48) , C (cid:48) G , k − ) σ -Separatedness implies that σ ≥ J ( Q, C G , k ) J ( Q, C G , k − ) ≥ nn (cid:48) (1 + δ ) − J ( Q (cid:48) , C G , k ) J ( Q, C G (cid:48) , k − ) nn (cid:48) (1 + δ ) − J ( Q (cid:48) , C (cid:48) G , k ) nn (cid:48) (1 − δ ) − J ( Q (cid:48) , C (cid:48) G , k − ) = (1 − δ ) J ( Q (cid:48) , C (cid:48) G , k )(1 + δ ) J ( Q (cid:48) , C (cid:48) G , k − ) This implies σ δ − δ ≥ J ( Q (cid:48) , C (cid:48) G , k ) J ( Q (cid:48) , C (cid:48) G , k − ) So we claim
Theorem 7.
Under the assumptions and notation of Theorem 2, If the dataset Q has the property of σ -Separatedness in the original space, then withprobability at least − (cid:15) it has the property of σ (cid:113) δ − δ -Separatedness in theprojected space. The fact that this Separatedness increases is of course a defficiency, be-cause clustring algorithms require as low Separatedness as possible (becausethe clusters are then better separated).Let us turn to the ( c, σ ) -Approximation-Stability. We can reformulate itas follows: if the distance (in some space) between the partitions is morethan σ then the cost function values of two partitions differ by at least thefactor c > . Consider now two partitions C , C , with distance over σ insome abstract partition space, not related to the embedding spaces. Then inthe original space the following must hold. J ( Q, C ) ≥ c · J ( Q, C ) Under the projection we get (1 − δ ) − nn (cid:48) J ( Q (cid:48) , C ) ≥ c · (1 + δ ) − nn (cid:48) J ( Q (cid:48) , C ) J ( Q (cid:48) , C ) ≥ c · − δ δ J ( Q (cid:48) , C ) This result means that
Theorem 8.
Under the assumptions and notation of Theorem 2, if thedata set Q has the property of ( c, σ ) -Approximation- Stability in the originalspace, then with probability at least − (cid:15) it has the property of ( c · − δ δ , σ -Approximation Stability property in the projected space. Let us now consider s -Multiplicative Perturbation Stability. We claimthat 20 emma 5. If the data set Q has the property of √ s -Multiplicative Pertur-bation Robustness under the distance √ d , and the set Q p is its perturba-tion with distance √ d such that νd ≤ d ≤ ν d , and s = ν · s p , where < ν, s p < , then set Q p has the property of √ s p -Multiplicative Perturba-tion RobustnessProof. Apparently Q p is a perturbation of Q such that both share sameoptimal clustering. Let Q q be a perturbation of Q p , with distance √ d , suchthat s p d ≤ d ≤ s p d . Then sd = s p νd ≤ s p d ≤ d ≤ s p d ≤ s p ν d = s d that is Q q is a perturbation of Q such that both share same optimal clustering.So Q p and Q q share common optimal clustering, hence Q p has the propertyof √ s p -Multiplicative Perturbation RobustnessWe claim that Lemma 6.
Under the assumptions and notation of Theorem 2, if the dataset Q has the property of √ s -Multiplicative Perturbation Robustness with s < − δ , and if C G is the global optimum of k -means in Q , then it is alsothe global optimum in Q (cid:48) with probability at least − (cid:15) Proof.
Assume the contrary that is that in Q (cid:48) some other clustering C (cid:48) G isthe global optimum. Let us define the distance (cid:112) d ( i, j ) = (cid:107) x i − x j (cid:107) and (cid:112) d ( i, j ) = nn (cid:48) (cid:107) x (cid:48) i − x (cid:48) j (cid:107) . The distance √ d is a realistic distance in thecoordinate system C as we assume n > n (cid:48) . As the k -means optimum doesnot change under rescaling, so C (cid:48) G is also an optimal solution for clusteringtask under d . But sd ( i, j ) < (1 − δ ) d ( i, j ) ≤ d ( i, j ) ≤ (1+ δ ) d ( i, j ) < (1 − δ ) − d ( i, j ) < s − d ( i, j ) hence the distance √ d is a perturbation of √ d and hence C G should beoptimal under √ d also. We get a contradiction. So the claim of the lemmamust be true.This implies that Theorem 9.
Under the assumptions and notation of Theorem 2, if the dataset Q has the property of √ s -Multiplicative Perturbation Robustness with fac-tor s < s p ν (1 − δ ) δ ( < s p , ν < ) in the original space, then with probability atleast − (cid:15) it has the property of √ s p -Multiplicative Perturbation Robustnessin the projected space.Proof. The Lemma 6 implies that the global optima of the original and pro-jected spaces are identical. So assume that in the original space for the21istance (cid:112) d ( i, j ) = (cid:107) x i − x j (cid:107) C G is the optimal clustering. Then underprojection (cid:112) d (cid:48) ( i, j ) = (cid:107) x (cid:48) i − x (cid:48) j (cid:107) we have the same optimal clustering.For a perturbation with factor s p in the projected space define the distance (cid:112) d (cid:48) ( i, j ) = (cid:107) y (cid:48) i − y (cid:48) j (cid:107) where for any i let y (cid:48) i be a perturbation of x (cid:48) i . We willbe done if we can demonstrate that (cid:112) d (cid:48) yields the same optimum in theprojected space as (cid:112) d (cid:48) does. For any i let y i be some point in the originalspace such that y (cid:48) i is its projection to the projected space. We will treat y i as an image of x i and will subsequently show that the set of these points y i can be treated as a perturbation of x i with the factor √ s .For each counterpart (cid:112) d ( i, j ) = (cid:107) y i − y j (cid:107) of (cid:112) d (cid:48) in original space (1 + δ ) − nn (cid:48) d (cid:48) ( i, j ) ≤ d ( i, j ) ≤ (1 − δ ) − nn (cid:48) d (cid:48) ( i, j ) holds. As s p d (cid:48) ( i, j ) ≤ d (cid:48) ( i, j ) ≤ ( s p ) − d (cid:48) ( i, j ) and (1 − δ ) d ( i, j ) ≤ nn (cid:48) d (cid:48) ( i, j ) ≤ (1 + δ ) d ( i, j ) weobtain sd ( i, j ) < (1+ δ ) − s p (1 − δ ) d ( i, j ) ≤ (1+ δ ) − s p nn (cid:48) d (cid:48) ( i, j ) ≤ (1+ δ ) − nn (cid:48) d (cid:48) ( i, j ) ≤ d ( i, j ) ≤ (1 − δ ) − nn (cid:48) d (cid:48) ( i, j ) ≤ (1 − δ ) − s − p nn (cid:48) d (cid:48) ( i, j ) ≤ (1 − δ ) − s − p (1 + δ ) d ( i, j ) < s d ( i, j ) So √ d is a perturbation of √ d with the factor √ s . √ d is √ s -multiplicativeperturbation robust, therefore both have the same optimal solution C G . Fur-thermore √ d has the property of (cid:112) ν (1 − δ ) -Multiplicative Robustness (seeLemma 5). Therefore its counterpart (cid:112) d (cid:48) has the same optimum clustering C G as √ d (see Lemma 6), hence as √ d , hence as (cid:112) d (cid:48) .Recall that (cid:112) d (cid:48) was selected as any perturbation of (cid:112) d (cid:48) with factor √ s p .And it turned out that it yields the same optimal solution as (cid:112) d (cid:48) . So withhigh probability (factor 2 is taken as we deal with two data sets, comprisingpoints x i and y i ) (cid:112) d (cid:48) possesses √ s p -Multiplicative Perturbation Robustnessin the projected space.We claim Theorem 10.
Under the assumptions and notation of Theorem 2, if thedata set Q has both the property of β -Centre Stability and √ s -MultiplicativePerturbation Robustness with s < − δ in the original space, then withprobability at least − (cid:15) it has the property of β (cid:113) − δ δ -Centre Stability in theprojected space.Proof. The √ s -Multiplicative Perturbation Robustness ensures that both theoriginal and the projected space share same optimal clustering C .22onsider a data point x i and a cluster C ∈ C not containing i . Then x i , µ ( C ) and µ ( C ∪ { i } ) are colinear. So are x (cid:48) i , µ (cid:48) ( C ) and µ (cid:48) ( C ∪ { i } ) , that isthe respective (linear) projections. Furthermore (cid:107) x i − µ ( C ∪{ i } ) (cid:107)(cid:107) µ ( C ) − µ ( C ∪{ i } ) (cid:107) = | C | , hence (cid:107) x i − µ ( C ) (cid:107) = | C | +1 | C | (cid:107) x i − µ ( C ∪ { i } ) (cid:107) . Likewise (cid:107) x (cid:48) i − µ (cid:48) ( C ∪{ i } ) (cid:107)(cid:107) µ (cid:48) ( C ) − µ (cid:48) ( C ∪{ i } ) (cid:107) = | C | .Upon projection the distance to own cluster centre can increase relativelyby √ δ and to the C ∪ { i } centre can decrease by √ − δ , see Lemma 1.That means (cid:107) x (cid:48) i − µ (cid:48) ( C ( i )) (cid:107) ≤ (1 + δ ) n (cid:48) n (cid:107) x i − µ ( C ( i )) (cid:107) and (1 − δ ) − (cid:107) x (cid:48) i − µ (cid:48) ( C ∪ { i } ) (cid:107) ≥ n (cid:48) n (cid:107) x i − µ ( C ∪ { i } ) (cid:107) . Due to the aforementioned relations (cid:107) x (cid:48) i − µ (cid:48) ( C ) (cid:107) ≥ (1 − δ ) n (cid:48) n (cid:107) x i − µ ( C ) (cid:107) . Due to β -Centre-Stability in theoriginal space we had: β (cid:107) x i − µ ( C ( i )) (cid:107) < (cid:107) x i − µ ( C ) (cid:107) . Due to theaforementioned relations we have (cid:107) x (cid:48) i − µ (cid:48) ( C ) (cid:107) ≥ (1 − δ ) n (cid:48) n (cid:107) x i − µ ( C ) (cid:107) > β (1 − δ ) n (cid:48) n (cid:107) x i − µ ( C ( i )) (cid:107) ≥ β − δ δ (cid:107) x (cid:48) i − µ (cid:48) ( C ( i )) (cid:107) That is (cid:107) x (cid:48) i − µ (cid:48) ( C ) (cid:107) > β (cid:113) − δ δ (cid:107) x (cid:48) i − µ (cid:48) ( C ( i )) (cid:107) Hence the data centre stabilitycan drop to β (cid:113) − δ δ .We claim Theorem 11.
Under the assumptions and notation of Theorem 2, if thedata set Q has both the property of (1 + β ) Weak Deletion Stability and √ s -Multiplicative Perturbation Robustness with s < − δ in the original space,then with probability at least − (cid:15) it has the property of (1 + β ) − δ δ WeakDeletion Stability in the projected space.Proof.
The √ s -Multiplicative Perturbation Robustness ensures that bothoriginal and the projected space share same optimal clustering. Let thisoptimal clustering be called C o . By C denote any clustering obtained from C o by deletion of one cluster centre and assigning cluster elements to one ofthe remaining clusters. By the assumption of (1+ β )-Weak Deletion stability (1 + β ) J ( Q, C o ) ≤ J ( Q, C ) .Theorem 2 implies that (1 − δ ) n (cid:48) n J ( Q, C ) ≤ J ( Q (cid:48) , C ) and (1 + δ ) − J ( Q (cid:48) , C o ) ≤ n (cid:48) n J ( Q, C o ) .Therefore J ( Q (cid:48) , C ) ≥ (1 − δ ) n (cid:48) n J ( Q, C ) ≥ (1 + β )(1 − δ ) n (cid:48) n J ( Q, C o ) ≥ (1 + β )(1 − δ )(1 + δ ) − J ( Q (cid:48) , C o ) which implies the claim.23able 1: Dependence of reduced dimensionality n (cid:48) on sample size m . Otherparameters fixed at (cid:15) =0.01 δ =0.05 n =5e+05. m n (cid:48) explicit n (cid:48) implicit explicit/implicit10 15226 14209 1.0720 17518 16389 1.0750 20547 19191 1.07100 22839 21269 1.07200 25131 23323 1.08500 28160 26016 1.081000 30452 28030 1.092000 32744 30027 1.095000 35773 32648 1.110000 38065 34609 1.120000 40357 36554 1.150000 43386 39097 1.111e+05 45678 41017 1.112e+05 47970 42910 1.125e+05 50999 45392 1.121e+06 53291 47250 1.132e+06 55582 49099 1.135e+06 58612 51515 1.141e+08 68516 59243 1.162e+07 63195 55127 1.155e+07 66225 57480 1.151e+08 68516 59243 1.16Table 2: Dependence of reduced dimensionality n (cid:48) on failure prob. (cid:15) . Otherparameters fixed at m =2e+06 δ =0.05 n =5e+05. (cid:15) n (cid:48) explicit n (cid:48) implicit explicit/implicit0.1 51776 46020 1.130.05 52922 46955 1.130.02 54437 48180 1.130.01 55582 49099 1.130.005 56728 50014 1.130.002 58243 51221 1.140.001 59389 52134 1.1424 e+01 1e+03 1e+05 1e+07 Dependence of reduced dimensionality n' on sample size m n': black − explicit, green − implicitm n ' Figure 1: Dependence of reduced dimensionality n (cid:48) on sample size m . Otherparameters fixed at (cid:15) =0.01 δ =0.05 n =5e+0525 .001 0.002 0.005 0.010 0.020 0.050 0.100 Dependence of reduced dimensionality n' on failure prob. epsilon n': black − explicit, green − implicitepsilon n ' Figure 2: Dependence of reduced dimensionality n (cid:48) on failure prob. (cid:15) . Otherparameters fixed at m =2e+06 δ =0.05 n =5e+0526able 3: Dependence of reduced dimensionality n (cid:48) on error range δ . Otherparameters fixed at m =2e+06 (cid:15) =0.01 n =5e+05. δ n (cid:48) explicit n (cid:48) implicit explicit/implicit0.5 712 697 1.020.4 1059 1032 1.030.3 1787 1745 1.020.2 3804 3692 1.030.1 14339 13640 1.050.09 17593 16631 1.060.08 22128 20742 1.070.07 28721 26604 1.080.06 38846 35329 1.10.05 55582 49099 1.130.04 86291 72387 1.190.03 152415 115298 1.320.02 340701 201059 1.690.01 1353858 1353859 1Table 4: Dependence of reduced dimensionality n (cid:48) on original dimensionality n . Other parameters fixed at m =2e+06 (cid:15) =0.01 δ =0.05. n n (cid:48) explicit n (cid:48) implicit explicit/implicit4e+05 55582 47891 1.165e+05 55582 49099 1.136e+05 55582 49933 1.117e+05 55582 50551 1.18e+05 55582 51025 1.099e+05 55582 51399 1.081e+06 55582 51703 1.0827 .01 0.02 0.05 0.10 0.20 0.50 Dependence of reduced dimensionality n' on error range delta n': black − explicit, green − implicitdelta n ' Figure 3: Dependence of reduced dimensionality n (cid:48) on error range δ . Otherparameters fixed at m =2e+06 (cid:15) =0.01 n =5e+0528 e+05 5e+05 6e+05 7e+05 8e+05 9e+05 1e+06 Dependence of reduced dimensionality n' on original dimensionality n n': black − explicit, green − implicitn n ' Figure 4: Dependence of reduced dimensionality n (cid:48) on original dimensionality n . Other parameters fixed at m =2e+06 (cid:15) =0.01 δ =0.0529 .0 0.2 0.4 0.6 0.8 1.0 . . . . . Discrepancy between original and projected squared distances ordered by y p r o j e c t ed / o r i g i na l s qua r ed d i s t. * n / n ' Figure 5: Discrepancy between projected and original squared distances be-tween points in the sample expressed as their quotient adjusted by n/n (cid:48) .Parameters fixed at m= 5000 (cid:15) = 0.1 δ = 0.2 n = 5000 n (cid:48) = 218830 Numerical Experiments on Some Aspectsof Our Approach
Note that we have two formulas for computing the reduced space dimension-ality n (cid:48) , the formula (11) and (4). The latter does not engage the originaldimensionality n , while it is explicit in n (cid:48) . The value of n (cid:48) in the formerdepends on n , however n (cid:48) can be only computed iteratively.Let us investigate the differences between n (cid:48) computation in both cases.Let us check the impact of the following parameters: n - the original di-mensionality (see table 4 and figure 4), δ - the limitation of deviation of thedistances between data points in the original and the reduced space (see table3 and figure 3), m - the sample size (see table 1 and figure 1), as well as (cid:15) -the maximum failure probability of the ”JL” transformation (see table 2 andfigure 2). Note that in all figures the X-axis is on log scale.As visible in figure 4 the value of n (cid:48) from the explicit formula does notdepend on the original dimensionality n . The value computed from theimplicit formula approaches the explicit value quite quickly with the growingdimensionality n .On the other hand, the implicit n (cid:48) departs from the explicit one withgrowing sample size m , as visible in fig. 1. Both grow with increasing m .In fig. 2 we see that when we increase the acceptable failure rate (cid:15) , therequested dimensionality n (cid:48) drops, whereby the implicit one approaches theexplicit one.Fig. 3 shows that the requested dimensionality drops quite quickly withincreased relative error range δ till a kind of saturation is achieved. Atextreme ends of δ implicit and explicit n (cid:48) formulas converge to one another.The behaviour of explicit n (cid:48) is not surprising, as it is visible directlyfrom the formula (4). The important insight here is however the requireddimensionality of the projected data, of hundreds of thousands for realistic (cid:15), δ . So the random projection via the Johnson-Lindenstrauss Lemma is notyet another dimensionality reduction technique. It is suitable for cases wheretechniques like PCA are not feasible computationally.The behaviour of implicit n (cid:48) for the case of increasing original dimension-ality n is as expected - the explicit n (cid:48) reflects the ”in the limit” behaviour ofthe implicit formulation. The convergence for extreme values of δ is intrigu-ing. The discrepancy for (cid:15) and the divergence for growing m indicate thatthere is still space for better explicit formulas on n (cid:48) . Especially it is worthinvestigating for increasing m as the processing becomes more expensive inthe original space when m is increasing.In order to give an impression how effective the random projection is,31 .1 0.2 0.3 0.4 0.5 . . . . . Error margin versus gap between clusters expected gap r equ i r ed e rr o r de l t a pa r a m e t e r Figure 6: Permissible error range δ under various assumed gaps between theclusters.see fig. 5. It illustrates the distribution of discrepancies between squareddistances in the projected and in the original spaces. The discrepancies areexpressed as (cid:107) f ( u ) − f ( v ) (cid:107) (cid:107) u − v (cid:107) One can see that they correspond quite well to the imposed constraints.As the application for k -means clustering, we see in fig. 6 that the biggerthe relative gap between clusters, the larger the error value δ is permitted, ifclass membership shall not be distorted by the projection. Note that if we would set (cid:15) (close) to 1, and expand by Taylor method the ln function in denominator of the inequality (4) to up to three terms then we32able 5: Comparison of effort needed for k -means under our dimensionalityreduction approach and that of Dasgupta and Gupta [11], depending onsample size m . Other parameters fixed at (cid:15) = 0.01 δ = 0.05 n = 5e+05 m n (cid:48) explicit n (cid:48) implicit Gupta n (cid:48) Their Repetitions Our n (cid:48) to their n (cid:48)
10 15226 14209 3879 44 3.720 17518 16389 5046 90 3.350 20547 19191 6589 228 3100 22839 21269 7757 459 2.8200 25131 23323 8924 919 2.7500 28160 26016 10467 2301 2.51000 30452 28030 11635 4603 2.52000 32744 30027 12802 9209 2.45000 35773 32648 14345 23024 2.310000 38065 34609 15513 46050 2.320000 40357 36554 16680 92102 2.250000 43386 39097 18223 230257 2.21e+05 45678 41017 19391 460515 2.22e+05 47970 42910 20558 921032 2.15e+05 50999 45392 22101 2302583 2.11e+06 53291 47250 23269 4605168 2.12e+06 55582 49099 24436 9210339 2.15e+06 58612 51515 25979 23025849 21e+08 68516 59243 31025 460517014 22e+07 63195 55127 28314 92103402 25e+07 66225 57480 29857 230258508 21e+08 68516 59243 31025 460517014 233able 6: Comparison of effort needed for k -means under our dimensionalityreduction approach and that of Dasgupta and Gupta [11], depending onfailure prob. (cid:15) . Other parameters fixed at m= 2e+06 δ = 0.05 n = 5e+05 (cid:15) n (cid:48) explicit n (cid:48) implicit Gupta n (cid:48) Their Repetitions Our n (cid:48) to their n (cid:48) k -means under our dimensionalityreduction approach and that of Dasgupta and Gupta [11], depending on errorrange δ . Other parameters fixed at m= 2e+06 (cid:15) = 0.01 n = 5e+05 δ n (cid:48) explicit n (cid:48) implicit Gupta n (cid:48) Their Repetitions Our n (cid:48) to their n (cid:48) k -means under our dimensionalityreduction approach and that of Dasgupta and Gupta [11], depending onoriginal dimensionality n . Other parameters fixed at m= 2e+06 (cid:15) = 0.01 δ =0.05 n n (cid:48) explicit n (cid:48) implicit Gupta n (cid:48) Their Repetitions Our n (cid:48) to their n (cid:48) n (cid:48) from equation (2.1) from the paper [11]: n (cid:48) ≥ mδ − δ Note, however, that setting (cid:15) to a value close to 1 does not make sense aswe want to keep rare the event that the data does not fit the interval we areimposing.Though one may be tempted to view our results as formally similar tothose of Dasgupta and Gupta, there is one major difference. Let us first recallthat the original proof of Johnson and Lindenstrauss [13] is probabilistic,showing that projecting the m -point subset onto a random subspace of O (ln m/(cid:15) ) dimensions only changes the (squared) distances between pointsby at most − δ with positive probability. Dasgupta and Gupta showedthat this probability is at least /m , which is not much indeed. In order toget failure probability (cid:15) below say 0.05%, one needs to repeat the randomprojection and checking of distances r times, with such r that (cid:15) > (1 − m ) r .In case of m = 1 , this means over r = 2 , repetitions, and with m = 1 , , - over r = 2 , , repetitions,In this paper we have shown that this success probability can be raisedto − (cid:15) for an (cid:15) given in advance. Hereby the increase of target dimension-ality is small enough compared to Dasgupta and Gupta formula, that ourrandom projection method is orders of magnitude more efficient. A detailedcomparison is contained in the tables 5, 6, 7, 8. We present in these tables n (cid:48) computed using our formulas with those proposed by Dasgupta and Gupta35s well as we present the required number of repetition of projection ontosampled subspaces in order to obtain a faithful distance discrepancies withreasonable probability. Dasgupta and Gupta generally obtain several timeslower number of dimensions. However, as stated in the introduction, thenumber of repeated samplings annihilates this advantage and in fact a muchhigher burden when clustering is to be expected.Note that the choice of n (cid:48) has been estimated by [1] n (cid:48) ≥ (4 + 2 γ ) ln mδ − δ where γ is some positive number. They propose a projection based on two orthree discrete values randomly assigned instead of ones from normal distribu-tion. With the quantity γ they control the probability that a single elementof the set Q leaves the predefined interval ± δ . They do not bother aboutcontrolling the probability that none of the elements leaves the interval ofinterest. Rather, they derive expected values of various moments.Larsen and Nelson [14] concentrate on finding the highest value of n (cid:48) forwhich Johnson-Lindenstrauss Lemma does not hold demonstrating that thevalue they found is the tightest even for non-linear mappings f . Though notdirectly related to our research, they discuss the other side of the coin, thatis the dimensionality below which at least one point of the data set has toviolate the constraints. In this paper we investigated a novel aspect of the well known and widelyexplored and exploited Johnson-Lindenstrauss lemma on the possibility ofdimensionality reduction by projection onto a random subspace.The original formulation means in practice that we have to check whetheror not we have found a proper transformation f leading to error boundswithin required range for all pairs of points, and if necessary (and it is theo-retically necessary very frequently), to repeat the random projection processover and over again.We have shown here that it is possible to determine in advance the choiceof dimensionality in the random projection process as to assure with desiredcertainty that none of the points of the data set violates restrictions on errorbounds. This new formulation can be of importance for many data miningapplications, like clustering, where the distortion of distances influences theresults in a subtle way (e.g. k -means clustering).36ia some numerical examples we have pointed at the real applicationareas of this kind of projections, that is problems with high number of di-mensions, starting with dozens of thousands and hundreds of thousands ofdimensions.Additionally, our reformulation of the JL Lemma permits to preservesome well-known clusterability properties at the projection. References [1] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins.
J. Comput. Syst. Sci. , 66(4):671–687,June 2003.[2] Margareta Ackerman and Shai Ben-David. Clusterability: A theoreticalstudy. In David van Dyk and Max Welling, editors,
Proceedings of theTwelth International Conference on Artificial Intelligence and Statistics ,volume 5 of
Proceedings of Machine Learning Research , pages 1–8, HiltonClearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr2009. PMLR.[3] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors andthe fast johnson-lindenstrauss transform. In
Proceedings of the Thirty-eighth Annual ACM Symposium on Theory of Computing , STOC ’06,pages 557–563, New York, NY, USA, 2006. ACM.[4] D. Arthur and S. Vassilvitskii. k -means++: the advantages of carefulseeding. In N. Bansal, K. Pruhs, and C. Stein, editors, Proc. of the Eigh-teenth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA2007, pages 1027–1035, New Orleans, Louisiana, USA, 7-9 Jan. 2007.SIAM.[5] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Stability yields a ptas fork-median and k-means clustering. In
Proceedings of the 2010 IEEE 51stAnnual Symposium on Foundations of Computer Science , FOCS ’10,pages 309–318, Washington, DC, USA, 2010. IEEE Computer Society.[6] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clusteringunder perturbation stability.
Inf. Process. Lett. , 112(1-2):49–54, January2012.[7] Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Approximateclustering without the approximation. In
Proceedings of the Twentieth nnual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009,New York, NY, USA, January 4-6, 2009 , pages 1068–1077, 2009.[8] Shai Ben-David. Computational feasibility of clustering under cluster-ability assumptions. https://arxiv.org/abs/1501.00437 , 2015.[9] Yonatan Bilu and Nathan Linial. Are stable instances easy? Comb.Probab. Comput. , 21(5):643–660, September 2012.[10] Khai X. Chiong and Matthew Shum. Random projection estimation ofdiscrete-choice models with large choice sets, arxiv:1604.06036, 2016.[11] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theoremof johnson and lindenstrauss.
Random Struct. Algorithms , 22(1):60–65,January 2003.[12] Piotr Indyk and Assaf Naor. Nearest-neighbor-preserving embeddings.
ACM Trans. Algorithms , 3(3), August 2007.[13] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappingsinto a hilbert space. In
Conference in modern analysis and probability(New Haven, Conn., 1982). Also appeared in volume 26 of Contemp.Math., pages 189–206. Amer. Math. Soc., Providence, RI, 1984 , 1982.[14] Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss lemma.
CoRR , abs/1609.02094, 2016.[15] Jir´ı Matousek. On variants of the johnson-lindenstrauss lemma.
RandomStruct. Algorithms , 33(2):142–156, 2008.[16] Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and ChaitanyaSwamy. The effectiveness of lloyd-type methods for the k-means prob-lem.