New bounds for k-means and information k-means
aa r X i v : . [ m a t h . S T ] F e b NEW BOUNDS FOR K -MEANS AND INFORMATION K -MEANS B Y G AUTIER A PPERT AND O LIVIER C ATONI SAMM, Université Paris 1 Panthéon-Sorbonne, France, [email protected] CNRS – CREST, UMR 9194, Université Paris Saclay, France, [email protected]
In this paper, we derive a new dimension-free non-asymptotic upperbound for the quadratic k -means excess risk related to the quantization ofan i.i.d sample in a separable Hilbert space. We improve the bound of order O (cid:0) k/ √ n (cid:1) of Biau, Devroye and Lugosi, recovering the rate p k/n that hasalready been proved by Fefferman, Mitter, and Narayanan and by Klochkov,Kroshnin and Zhivotovskiy but with worse log factors and constants. Moreprecisely, we bound the mean excess risk of an empirical minimizer by the ex-plicit upper bound B log( n/k ) p k log( k ) /n , in the bounded case when P ( k X k ≤ B ) = 1 . This is essentially optimal up to logarithmic factors sincea lower bound of order O (cid:0)q k − /d /n (cid:1) is known in dimension d . Our tech-nique of proof is based on the linearization of the k -means criterion through akernel trick and on PAC-Bayesian inequalities. To get a / √ n speed, we in-troduce a new PAC-Bayesian chaining method replacing the concept of δ -netwith the perturbation of the parameter by an infinite dimensional Gaussianprocess.In the meantime, we embed the usual k -means criterion into a broaderfamily built upon the Kullback divergence and its underlying properties. Thisresults in a new algorithm that we named information k -means , well suitedto the clustering of bags of words. Based on considerations from informa-tion theory, we also introduce a new bounded k -means criterion that uses ascale parameter but satisfies a generalization bound that does not require anyboundedness or even integrability conditions on the sample. We describe thecounterpart of Lloyd’s algorithm and prove generalization bounds for thesenew k -means criteria. General notation.
We will use the following notation throughout this document.On some measurable probability space Ω , we will consider various random variables X :Ω → X , Y : Ω → Y , etc. that are nothing but measurable functions. We will also considerseveral probability measures on Ω , and typically two measures P and Q ∈ M (Ω) , where P describes the usually unknown data distribution and Q describes an estimation of P . Then wewill use the short notation P X for the push forward measure P ◦ X − , that is the law of X .Similarly we will let Q X = Q ◦ X − . In the same way P X,Y ∈ M ( X × Y ) will be the jointdistribution of the couple ( X, Y ) under P and P Y | X the corresponding regular conditionalprobability measure of Y knowing X when it exists. We will always work under sufficienthypotheses to ensure that the decomposition(1) P X, Y = P X P Y | X is valid, meaning that for any bounded measurable function f ( X, Y ) Z f d P X, Y = Z (cid:18)Z f d P Y | X (cid:19) d P X . MSC2020 subject classifications:
Primary 62H30, 62H20, 62C99; secondary 62B10.
Keywords and phrases: k -means criterion, vector quantization, PAC-Bayesian bounds, chaining, dimension-free bounds, empirical risk minimization, Hilbert space, kernel trick, centroid, Kullback-Leibler divergence, in-formation theory. Moreover, we will use the short notation Z f d P X, Y = P X, Y ( f ) , so that the previous formula becomes P X, Y ( f ) = P X (cid:2) P Y | X ( f ) (cid:3) . We will often use the Kullback Leibler divergence K (cid:0) Q, P (cid:1) = Q (cid:20) log (cid:18) d Q d P (cid:19)(cid:21) when Q ≪ P , + ∞ otherwise.We will always be in this article in a situation where the decompositionL EMMA K (cid:0) Q X,Y , P X,Y (cid:1) = K (cid:0) Q X , P X (cid:1) + Q X (cid:2) K (cid:0) Q Y | X , P Y | X (cid:1)(cid:3) = K (cid:0) Q Y , P Y (cid:1) + Q Y (cid:2) K (cid:0) Q X | Y , P X | Y (cid:1)(cid:3) is valid.P ROOF . It follows from the decomposition (1). A precise statement and a rigorous proofdealing with measurability issues can be found in [9, Appendix section 1.7 page 50].
1. Introduction.
This paper is about the most widely used loss function for vector quan-tization, the k -means criterion. We will be interested in the statistical setting where the prob-lem is to minimize the criterion for a random vector whose distribution is unknown but canbe estimated through an i.i.d. random sample. Our main contribution will be to prove a newdimension free non-asymptotic generalization bound with a better k/n dependence, where k is the number of centers used for vector quantization and n is the size of the statisticalsample. We will also give an interpretation of the k -means criterion in terms of the Kullbackdivergence and use it to embed it in a broader family of criteria with interesting properties.This will provide a specific algorithm for the quantization of conditional probability distribu-tions ranging in an exponential family that can be used in particular to analyse bag of wordsmodels. This generalization will also provide a new robust criterion for the quantization ofunbounded random vectors.Our general setting is the following. Given a random variable X ∈ H ranging in a separa-ble Hilbert space H , we are interested in minimizing the risk function R (cid:0) c , . . . , c k (cid:1) = P X (cid:16) min j ∈ J ,k K k X − c j k (cid:17) , ( c , . . . , c k ) ∈ H k . We will assume that the statistician does not know the distribution P X , but has access insteadto a sample ( X , . . . , X n ) made of n independent copies of X . If P X = 1 n n X i =1 δ X i is theempirical measure of the sample, the empirical risk, or empirical k -means criterion, is definedas R ( c ) = P X (cid:16) min j ∈ J ,k K k X − c j k (cid:17) = 1 n n X i =1 (cid:16) min j ∈ J ,k K k X i − c j k (cid:17) , c ∈ H k . EW BOUNDS FOR K -MEANS We will first consider the bounded case. Given a ball B = (cid:8) x ∈ H : k x k ≤ B (cid:9) , we willassume that P (cid:0) X ∈ B (cid:1) = 1 . We will study the upper deviations of the random variable sup c ∈ B k h R ( c ) − R ( c ) i . This will provide an observable upper bound for the risk that is uniform with respect to thechoice of centers c ∈ B k . Being uniform with respect to c covers the case where the centershave been computed from the observed sample through some algorithm. In order to study theexcess risk of estimators, we will also study the upper deviations of sup c ∈ B k h R ( c ) − R ( c ∗ ) − R ( c ) + R ( c ∗ ) i . This random variable compares uniformly with respect to c the excess risk of c with respectto a non random reference c ∗ and the corresponding empirical excess risk. In particular, thenon random reference c ∗ can be chosen to be a minimizer, or more generally an ε -minimizer,of the risk R .Indeed, we will consider b c ∈ B k , depending on the sample, such that R ( b c ) ≤ inf c ∈ B k R ( c ) + ε, and provide a bound for the excess risk R ( b c ) − inf c ∈ B k R ( c ) . To complement deviationbounds, we will also provide corresponding bounds in expectation.Regarding the sample size n , we obtain a speed of order O (cid:0) / √ n (cid:1) as in [6], [16] and [17].However, we get a better dependence in k , with a rate of convergence of O (cid:0)p k/n (cid:1) up tolog factors. This is essentially optimal up to log factors, at least in infinite dimension, sinceminimax lower bounds for the excess k -means risk are of order O (cid:0)p k − /d /n (cid:1) in dimension d , see [4] and [1]. We should mention that the speed p k/n has already been established in[15] (see Lemma 6), [18] and [21], but with worse log factors and less explicit constants.These bounds will be obtained using PAC-Bayesian inequalities combined with a kerneltrick and a new kind of PAC-Bayesian chaining method that we developed. In particular,borrowing ideas from the construction of the isonormal Gaussian process [26, section 3.5],we will use the distribution of an infinite sequence of shifted Gaussian random variablesboth for the prior and the posterior parameter distribution. We will also use some argumentsfrom the proofs of [11] and [12], concerning the estimation of the mean of a random vec-tor. Furthermore, we take inspiration from the classical chaining procedure for bounding theexpected suprema of sub-Gaussian processes (see section 13.1 in [7]). We create a PAC-Bayesian version of chaining in which the concept of δ -net and δ -covering is replaced by theuse of a sequence of Gaussian perturbations parametrized by a variance ranging on a loga-rithmic grid. We combine this PAC-Bayesian chaining with the use of the influence function ψ described in [10] to decompose the excess risk into a sub-Gaussian part and an other partrepresenting extreme values. It is worth mentioning that we will work with weak hypothesesand will in particular not consider the kind of margin assumptions that are necessary to getbounds decreasing faster than p /n for a given value of k , as in [22], [23],[24], and [25].
2. Extensions of the k -means criterion. Before proving generalization bounds, let usembed the k -means criterion in a broader family of risk functions.We will do this while considering the square of the Euclidean distance, or more generallyin possibly infinite dimension the square of the Hilbert norm, as the Kullback divergencebetween two Gaussian measures. In this interpretation, vector quantization according to the k -means criterion will appear as a special case of conditional probability measure quantiza-tion according to an entropy criterion. To describe things at a more technical level, we need first to define the classification func-tion underlying vector quantization.To a set of centers c ∈ H k indeed corresponds a classification function ℓ : H → J , k K intoVoronoï cells defined as ℓ ( x ) = arg min j ∈ J ,k K k x − c j k . This definition may not be unique if the minimum is reached more than once, in which case,we make an arbitrary choice, as for instance ℓ ( x ) = min n arg min j ∈ J ,k K k x − c j k o . The corresponding vector quantization function is f ( x ) = c ℓ ( x ) , x ∈ H. To study the quality of the quantization of a random variable X ∈ H in terms of condi-tional probability distributions, we introduce another random variable Y ∈ R N and consideron some probability space Ω a realization ( X, Y ) : Ω → H × R N of the couple of randomvariables ( X, Y ) . We can for instance take Ω = H × R N and let ( X, Y ) be the identity. In-troduce now a probability measure P ∈ M (Ω) such that P X is the law of X and such that P Y | X is the law of the independent sequence h X, e i i + σε i , i ∈ N , where (cid:0) ε i , i ∈ N (cid:1) is ani.i.d. sequence of standard normal random variables, where (cid:0) e i , i ∈ N (cid:1) is a basis of H andwhere σ > is a standard deviation parameter. In other words, let P Y | X = O i ∈ N N (cid:0) h X, e i i , σ (cid:1) . Let us also introduce a probability measure Q ( c ) ∈ M (Ω) such that Q ( c ) X = P X and Q ( c ) Y | X = O i ∈ N N (cid:0) h c ℓ ( X ) , e i i , σ (cid:1) , where ℓ is defined from c as explained above. In other words, Q ( c ) Y | X is the distribution of therandom sequence h c ℓ ( X ) , e i i + σε i , i ∈ N . We see that Q ( c ) Y | X is a quantization of P Y | X thattakes k values, in the same way as f ( X ) = c ℓ ( X ) is a quantization of X itself.P ROPOSITION The k -means criterion R can be expressed as (2) R ( c ) = 2 σ P X (cid:2) K (cid:0) Q ( c ) Y | X , P Y | X (cid:1)(cid:3) = 2 σ K (cid:0) Q ( c ) X, Y , P X, Y (cid:1) . P ROOF . The first equality comes from the fact that K (cid:0) Q ( c ) Y | X , P Y | X (cid:1) = X i ∈ N K (cid:0) N ( h c ℓ ( X ) , e i i , σ ) , N ( h X, e i i , σ ) (cid:1) = X i ∈ N σ (cid:0) h c ℓ ( X ) , e i i − h X, e i i (cid:1) = 12 σ k X − c ℓ ( X ) k = 12 σ min j ∈ J ,k K k X − c j k . The second equality is a consequence of the decomposition stated in Lemma 1 on page 2,that says that K (cid:0) Q ( c ) X, Y , P X, Y (cid:1) = K (cid:0) Q ( c ) X , P X (cid:1) + Q ( c ) X (cid:2) K (cid:0) Q ( c ) Y | X , P Y | X (cid:1)(cid:3) and of the fact that Q ( c ) X = P X by definition of Q ( c ) . EW BOUNDS FOR K -MEANS The two equalities of Proposition 2 will be interesting to extend the k -means criterion. Letus draw the consequences of the first one first. Introduce µ ( c ) j = O i ∈ N N (cid:0) h c j , e i i , σ (cid:1) ∈ M ( R N ) , j ∈ J , k K . The first part of equation (2) can be written as R ( c ) = 2 σ P X (cid:2) K (cid:0) µ ( c ) ℓ ( X ) , P Y | X (cid:1)(cid:3) = 2 σ P X (cid:2) min j ∈ J ,k K K (cid:0) µ ( c ) j , P Y | X (cid:1)(cid:3) , since σ K (cid:0) µ ( c ) j , P Y | X (cid:1) = k X − c j k . Note that we could also have used σ K (cid:0) P Y | X , µ ( c ) j (cid:1) = k X − c j k , since the Kullbackdivergence between Gaussian measures is symmetric. This would have lead to another inter-pretation of the k -means criterion in a space of conditional probability measures. The choicewe made is quite unusual, but is justified by the following property.P ROPOSITION The minimization of R ( c ) seen as a function of µ ( c ) can be extendedto the larger set M (cid:0) R N (cid:1) k . In other words, if we put e R ( µ ) = 2 σ P X (cid:2) min j ∈ J ,k K K (cid:0) µ j , P Y | X (cid:1)(cid:3) , µ ∈ M ( R N ) k , we can see that R ( c ) = e R ( µ ( c ) ) and the minimum of e R coincides with the minimum of R inthe sense that inf c ∈ H k R ( c ) = inf c ∈ H k e R ( µ ( c ) ) = inf µ ∈ M ( R N ) k e R ( µ ) . P ROOF . Given µ ∈ M ( R N ) k , we have to find c ∈ H k such that e R ( µ ) ≥ R ( c ) . This willprove that inf c ∈ H k R ( c ) = inf c ∈ H k e R ( µ ( c ) ) ≤ inf µ ∈ M ( R N ) k e R ( µ ) , and since the reverse inequality is obvious from the fact that we take the infimum on a largerset, this will prove the proposition.Consider then ℓ ( X ) = min n arg min j ∈ J ,k K K (cid:0) µ j , P Y | X (cid:1)o and c j = P X | ℓ ( X )= j ( X ) , j ∈ J , k K . It is easy to check that the centers c j are such that d µ ( c ) j d µ (0) j = Z − j exp (cid:26) P X | ℓ ( X )= j (cid:20) log (cid:18) d P Y | X d µ (0) j (cid:19)(cid:21)(cid:27) , where Z j is a normalizing constant and µ (0) is µ ( c ) with c = 0 ∈ H k , the centered Gaussianmeasure. Indeed, Gaussian measures with the same covariance form an exponential familyindexed by their means. Taking the arithmetic mean of the parameter in an exponential family results in taking the geometric mean of the probability measures. Thus, µ ( c ) j is the geometricmean of P Y | X with weights P X | ℓ ( X )= j . As a consequence, for any j ∈ J , k K , P X | ℓ ( X )= j (cid:2) K (cid:0) µ ℓ ( X ) , P Y | X (cid:1)(cid:3) = P X | ℓ ( X )= j (cid:26) µ j (cid:20) log (cid:18) d µ j d P Y | X (cid:19)(cid:21)(cid:27) = µ j (cid:20) log (cid:18) d µ j d µ (0) j (cid:19)(cid:21) − µ j (cid:26) P X | ℓ ( X )= j (cid:20) log (cid:18) d P Y | X d µ (0) j (cid:19)(cid:21)(cid:27) = µ j (cid:20) log (cid:18) d µ j d µ (0) j (cid:19)(cid:21) − µ j (cid:20) log (cid:18) d µ ( c ) j d µ (0) j (cid:19)(cid:21) − log( Z j ) = µ j (cid:20) log (cid:18) d µ j d µ ( c ) j (cid:19)(cid:21) − log( Z j ) . Moreover, considering the case when µ = µ ( c ) , we see that P X | ℓ ( X )= j (cid:2) K (cid:0) µ ( c ) ℓ ( X ) , P Y | X (cid:1)(cid:3) = − log( Z j ) . Therefore P X (cid:2) min j ∈ J ,k K K (cid:0) µ j , P Y | X (cid:1)(cid:3) = P X (cid:2) K (cid:0) µ ℓ ( X ) , P Y | X (cid:1)(cid:3) = P ℓ ( X ) P X | ℓ ( X ) (cid:2) K (cid:0) µ ℓ ( X ) , P Y | X (cid:1)(cid:3) = P X (cid:2) K (cid:0) µ ℓ ( X ) , µ ( c ) ℓ ( X ) (cid:1)(cid:3) + P X (cid:2) K (cid:0) µ ( c ) ℓ ( X ) , P Y | X (cid:1)(cid:3) ≥ P X (cid:2) K (cid:0) µ ( c ) ℓ ( X ) , P Y | X (cid:1)(cid:3) ≥ P X (cid:2) min j ∈ J ,k K K (cid:0) µ ( c ) j , P Y | X (cid:1)(cid:3) , showing that e R ( µ ) ≥ e R (cid:0) µ ( c ) (cid:1) .So Proposition 3 shows that the k -means algorithm also solves a quantization problemfor Gaussian conditional probability measures P Y | X . This is an invitation to study moregenerally the quantization problem for conditional probability measures, using what we willcall the information k -means criterion e R ( µ ) . This is what will be done in section 4 on page 8.Let us now come back to the second equality of equation (2) on page 4. It relates theminimization of the k -means criterion with the estimation of the joint probability measure P X, Y . Instead of considering the single distribution Q ( c ) X, Y we can optimize the value of Q ( c ) X ,considering the model Q ( c ) = (cid:8) Q ∈ M (Ω) : Q Y | X = µ ( c ) ℓ ( X ) (cid:9) ∋ Q ( c ) X, Y . In order to get a better approximation of P X, Y , it is natural to consider instead of R ( c ) thecriterion C ( c ) = 2 σ inf Q ∈ Q ( c ) K (cid:0) Q X, Y , P X, Y (cid:1) ≤ R ( c ) = 2 σ K (cid:0) Q ( c ) X, Y , P X, Y (cid:1) . It turns out that this infimum can be computed.P
ROPOSITION Consider the classification function (3) ℓ c ( x ) = min n arg min j ∈ J , k K k X − c j k o . EW BOUNDS FOR K -MEANS The above criterion is equal to C ( c ) = 2 σ inf Q ∈ Q ( c ) K (cid:0) Q X, Y , P X, Y (cid:1) = − σ log P X n exp (cid:2) − K (cid:0) µ ( c ) ℓ c ( X ) , P Y | X (cid:1)(cid:3)o = − σ log P X n exp h − σ (cid:13)(cid:13) X − c ℓ c ( X ) (cid:13)(cid:13) io = − σ log P X n exp h − σ min j ∈ J ,k K (cid:13)(cid:13) X − c j (cid:13)(cid:13) io . P ROOF . For any Q ∈ Q ( c ) , use the decomposition stated in Lemma 1 on page 2, to obtainthat(4) K (cid:0) Q X, Y , P X, Y (cid:1) = K (cid:0) Q X , P X (cid:1) + Q X (cid:2) K (cid:0) Q Y | X , P Y | X (cid:1)(cid:3) = K (cid:0) Q X , P X (cid:1) + Q X (cid:2) K (cid:0) µ ( c ) ℓ c ( X ) , P Y | X (cid:1)(cid:3) . Minimizing this last expression with respect to Q X ∈ M ( H ) according to forthcomingLemma 7 on page 10 gives the first equality of the proposition, the others being obvious.The criterion C ( c ) is not a risk function in the sense that it is not the expectation of a lossfunction, but it is closely related to one. Indeed we can introduce(5) R ( c ) = 2 σ h − exp (cid:16) − σ C ( c ) (cid:17)i ≤ C ( c ) ≤ R ( c ) that is equal to R ( c ) = 2 σ P X h − exp (cid:16) − σ min j ∈ J ,k K k X − c j k (cid:17)i according to the previous proposition. We see that the risk R is a natural modification of therisk R when we relate R to the estimation of P X, Y . This new risk R is smaller, meaningthat it should be easier to minimize and indeed, as it is the expectation of a bounded lossfunction, we will get a generalization bound under weaker hypotheses than what we will askfor R . More specifically, we will assume no more that the sample is bounded.
3. Study of the robust quadratic k -means criterion. We can find a local minimum ofthe usual quadratic k -means criterion using Lloyd’s algorithm that updates the centers andthe classification function alternately. In this section, we will describe a similar algorithmfor the robust criterion of equation (5). According to this equation, R ( c ) is an increasingfunction of C ( c ) , so that we can as well study the minimization of C ( c ) . The discussionwill also cover the minimization of the corresponding empirical criteria, replacing the law of X , P X , by the empirical measure P X .According to the decomposition (4), σ C ( c ) = inf Q X ∈ M ( H ) K (cid:0) Q X , P X (cid:1) + Q X (cid:16) σ k X − c ℓ c ( X ) k (cid:17) . Moreover the infimum in Q X is reached at Q ∗ X ≪ P X defined by its density d Q ∗ X d P X = Z − exp (cid:16) − σ k X − c ℓ c ( X ) k (cid:17) . This proves P ROPOSITION k -means criterion). For any c ∈ H k ,consider the updated centers c ′ ∈ H k defined as c ′ j = Q ∗ X | ℓ c ( X )= j ( X ) = P X | ℓ c ( X )= j h X exp (cid:16) − σ (cid:13)(cid:13) X − c j (cid:13)(cid:13) (cid:17)i P X | ℓ c ( X )= j h exp (cid:16) − σ (cid:13)(cid:13) X − c j (cid:13)(cid:13) (cid:17)i , where ℓ c is defined by equation (3) on page 6. Then C ( c ′ ) ≤ C ( c ) − Q ∗ X (cid:0) k c ℓ c ( X ) − c ′ ℓ c ( X ) k (cid:1) ≤ C ( c ) . Accordingly R ( c ′ ) ≤ R ( c ) . So the update of the classification function is the same as in the usual case, and the updateof the centers performs a conditional mean with exponential weights instead of the condi-tional mean used in the original Lloyd’s algorithm.P
ROOF . We can see that σ C ( c ) = K (cid:0) Q ∗ X , P X (cid:1) + Q ∗ X (cid:16) σ k X − c ℓ c ( X ) k (cid:17) = K (cid:0) Q ∗ X , P X (cid:1) + Q ∗ X (cid:16) σ k X − c ′ ℓ c ( X ) k (cid:17) + Q ∗ X (cid:16) σ k c ℓ c ( X ) − c ′ ℓ c ( X ) k (cid:17) ≥ K (cid:0) Q ∗ X , P X (cid:1) + Q ∗ X (cid:16) σ k X − c ′ ℓ c ′ ( X ) k (cid:17) + Q ∗ X (cid:16) σ k c ℓ c ( X ) − c ′ ℓ c ( X ) k (cid:17) ≥ σ C ( c ′ ) + Q ∗ X (cid:16) σ k c ℓ c ( X ) − c ′ ℓ c ( X ) k (cid:17) , keeping in mind that Q ∗ X depends on c .
4. Study of the information k -means criterion. In this section, we will study the in-formation k -means criterion e R ( µ ) of Proposition 3 on page 5 for more general models ofregular conditional probability measures P Y | X .Consider a couple of random variables ( X, Y ) ∈ X × Y , where X and Y are completeseparable metric spaces, so that we can define regular conditional probability measures. Sup-pose there exists a reference measure ν ∈ M (cid:0) Y (cid:1) such that P (cid:16) P Y | X ≪ ν (cid:17) = 1 . Define p X = d P Y | X d ν . We are interested in the case where P Y | X is known therefore providing abag of words model. This means that each random sample X is described by a random prob-ability measure P Y | X . In the original bag of words model, Y is a set of words, and P Y | X isthe distribution of words in a text X drawn at random from some corpus of texts. Here weinclude the case where X and Y can be more general measurable spaces.We introduce the following generalization of the criterion e R of Proposition 3 on page 5,that we will name the information k -means criterion: inf q ∈ (cid:0) L , ( ν ) (cid:1) k P X (cid:16) min j ∈ J ,k K K ( q j , p X ) (cid:17) , where J , k K = { , . . . , k } , L , ( ν ) = n q ∈ L ( ν ) : q ≥ , R q d ν = 1 o and K ( q j , p X ) = (R q j log (cid:0) q j /p X (cid:1) d ν, R q j (cid:0) p X = 0 (cid:1) d ν = 0 , + ∞ , otherwise EW BOUNDS FOR K -MEANS is the Kullback divergence between densities. The purpose of this section is to discuss thegeneral properties of the information k -means problem and to build a mathematical frame-work and algorithms to perform the minimization. As we have seen in the previous section,we chose to study this algorithm rather than the better known k -means divergence algorithm inf q ∈ (cid:0) L , ( ν ) (cid:1) k P X (cid:16) min j ∈ J ,k K K (cid:0) p X , q j (cid:1)(cid:17) because of Proposition 3 on page 5, showing that our proposal contains the classical Eu-clidean k -means as a special case. More generally, using the divergence in the way we dowhen the conditional probability measures P Y | X belong to an exponential family ensuresthat the optimal centers for a given classification function ℓ belong to that same exponentialfamily.We should point out that clustering histograms or more generally probability distributionsbased on the Kullback divergence or other information criteria is not a new subject. It hasbeen extensively used in text categorization and image indexing, especially in word cluster-ing to extract features or reduce the original space dimension, see [31], [33], [32], [14], [8],[35], and [20]. The clustering is essentially performed using the aforementioned k -meansdivergence algorithm. However, in the information k -means framework we follow a differentroute since the grouping step is done by minimizing the Kullback divergence with respectto its first argument instead of its second one. This leads to very different centroids, com-puted as geometric means of distributions instead of arithmetic means, see [5] and [34]. Thisfollows from the fact that the Kullback divergence is asymmetric. Nevertheless, symmetricextensions of the Kullback divergence built upon averaged symmetrizations have been stud-ied. Especially, centroids and k -means type algorithms derived from symmetrized divergencefunctions are analyzed in [34], [27], [30] and [28].Besides, following the set-up provided by the typical k -means divergence, [3] presents ageneral k -means framework based on the Bregman divergence. The authors show that suchcriteria can be minimized iteratively using a k -means centroid-based algorithm. The Bregmandistance encompasses many traditional similarity measures such as the Euclidean distance,the Kullback divergence, the logistic loss and many others. However, in the Kullback case,the minimization is performed with respect to the second argument, and not the first as in ourproposal. Nevertheless, the study of a symmetrized version of the Bregman divergence, andespecially the resulting centroids coming from it, is undertaken in [29].Our contribution in this paper is to provide a mathematical framework for the information k -means criterion. In particular, we will prove generalization bounds and deal with the infinitedimension case.Let us state some version of the Bayes rule that will be useful in the following discussion.L EMMA Let P X,Y be a joint distribution defined on the product of two Polish spaces.The following statements are equivalent: There exists a measure µ such that P Y | X ≪ µ , P X almost surely; P Y | X ≪ P Y , P X almost surely; P X,Y ≪ P X ⊗ P Y ; P X | Y ≪ P X , P Y almost surely.Moreover, they imply the following identities between Radon–Nikodym derivatives: d P X,Y d (cid:0) P X ⊗ P Y (cid:1) = d P Y | X d P Y = d P X | Y d P X . P ROOF . To prove that implies , it is sufficient to show that P Y | X (cid:16) d P Y d µ = 0 (cid:17) = 0 , P X almost surely. But when is true P Y | X d P Y d µ = 0 ! = Z (cid:18) d P Y d µ = 0 (cid:19) d P Y | X d µ d µ. Thus by the Tonelli-Fubini theorem P X P Y | X d P Y d µ = 0 !! = P X Z (cid:18) d P Y d µ = 0 (cid:19) d P Y | X d µ d µ ! = Z (cid:18) d P Y d µ = 0 (cid:19) P X (cid:20) d P Y | X d µ (cid:21) d µ = Z (cid:18) d P Y d µ = 0 (cid:19) d P Y d µ d µ = 0 . Therefore P Y | X (cid:16) d P Y d µ = 0 (cid:17) = 0 , P X almost surely. Obviously implies with µ = P Y .Now let us show that implies Let f be a bounded measurable function, we have byFubini’s theorem Z f d P X,Y = Z (cid:18)Z f d P Y | X (cid:19) d P X = Z (cid:18)Z f d P Y | X d P Y d P Y (cid:19) d P X = Z f d P Y | X d P Y d (cid:0) P Y ⊗ d P X (cid:1) , implying and that P X almost surely d P Y | X d P Y = d P X,Y d (cid:0) P X ⊗ P Y (cid:1) . We will show now that implies Let f be a bounded measurable function, we have byFubini’s theorem Z f d P X,Y = Z f d P X,Y d (cid:0) P X ⊗ P Y (cid:1) d (cid:0) P X ⊗ d P Y (cid:1) = Z (cid:18)Z f d P X,Y d (cid:0) P X ⊗ P Y (cid:1) d P Y (cid:19) d P X = Z (cid:18)Z f d P Y | X (cid:19) d P X , showing that P X almost surely P Y | X ≪ P Y and d P Y | X d P Y = d P X,Y d (cid:0) P X ⊗ P Y (cid:1) . The equivalence between and is immediate by interchanging the roles of X and Y .The following lemma will be useful to optimize the information k -means criterion and isrelated to the Donsker Varadhan representation. EW BOUNDS FOR K -MEANS L EMMA Let π ∈ M (Ω) be a probability measure on the measurable space Ω . Let h : Ω → R ∪ { + ∞} be a measurable function such that Z = Z exp( − h ) d π < ∞ . Let π exp( − h ) be the probability measure whose density with respect to π is proportional to exp( − h ) so that d π exp( − h ) d π = exp( − h ) Z .
The identity inf η ∈ Z (cid:18) K ( ρ, π ) + Z max { h, η } d ρ (cid:19) = − log Z exp( − h ) d π ! + K ( ρ, π exp( − h ) ) ∈ R ∪ { + ∞} is satisfied for any ρ ∈ M (Ω) and implies that inf ρ ∈ M (Ω) inf η ∈ Z K ( ρ, π ) + Z max { h, η } d ρ ! = − log Z exp( − h ) d π ! , the minimum being reached when ρ = π exp( − h ) . Note that the lemma could also be written as K ( ρ, π ) + Z h d ρ = − log (cid:18)Z exp( − h ) d π (cid:19) + K (cid:0) ρ, π exp( − h ) (cid:1) if we are willing to follow the convention that Z h d ρ = inf η ∈ Z Z max { h, η } d ρ and that + ∞ − ∞ = + ∞ .P ROOF . See [9, page 159]. Note that the role of η ∈ Z in this lemma is only to makesure that the integrals are always well defined in R ∪ { + ∞} in the sense that the negativepart of the integrand is integrable. When ρ is not absolutely continuous with respect to π ,it is also not absolutely continuous with respect to π exp( − h ) since π ( A ) = 0 if and only if π exp( − h ) ( A ) = 0 . In this case K ( ρ, π ) = K ( ρ, π exp( − h ) ) = + ∞ and the identity is true, bothsides being equal to + ∞ . When ρ ≪ π , then ρ ≪ π exp( − max { h,η } ) and d ρ d π exp( − max { h,η } ) = Z η exp(max { h, η } ) d ρ d π , where Z η = Z exp( − max { h, η } ) d π < + ∞ . Therefore K (cid:0) ρ, π exp( − max { h,η } ) (cid:1) = log (cid:0) Z η (cid:1) + Z h max { h, η } + log (cid:16) d ρ d π (cid:17)i d ρ. By the monotone convergence theorem lim η →−∞ Z η = Z and lim η →−∞ Z h max { h, η } + log (cid:16) d ρ d π (cid:17)i d ρ = Z h h + log (cid:16) d ρ d π (cid:17)i d ρ, since we know that Z h log( Z )+ h + log (cid:16) d ρ d π (cid:17)i − d ρ = Z log (cid:16) d ρ d π exp( − h ) (cid:17) − d ρ d π exp( − h ) d π exp( − h ) ≤ exp( − < + ∞ and therefore that Z h h + log (cid:16) d ρ d π (cid:17)i − d ρ < + ∞ . This proves that lim η →−∞ K (cid:0) ρ, π exp( − max { h,η } ) (cid:1) = log( Z ) + Z h h + log (cid:16) d ρ d π (cid:17)i d ρ = K (cid:0) ρ, π exp( − h ) (cid:1) = log( Z ) + inf η ∈ Z Z h max { h, η } + log (cid:16) d ρ d π (cid:17)i d ρ = log( Z ) + inf η ∈ Z (cid:18)Z max { h, η } d ρ + Z log (cid:16) d ρ d π (cid:17) d ρ (cid:19) = log( Z ) + inf η ∈ Z (cid:18)Z max { h, η } d ρ + K ( ρ, π ) (cid:19) , and therefore that K (cid:0) ρ, π exp( − h ) (cid:1) − log( Z ) = inf η ∈ Z (cid:18) K ( ρ, π ) + Z max { h, η } d ρ (cid:19) as stated in the lemma. The second statement of the lemma is a consequence of the fact thatthe Kullback divergence is non negative.Let us now formulate a precise definition of the geometric mean of conditional probabilitymeasures and show that it is their optimal center according to the information projectioncriterion.L EMMA Let P X,Y be a joint distribution defined on the product of two Polish spaces.Assume that P X (cid:0) P Y | X ≪ P Y (cid:1) = 1 . Consider the normalizing constant Z = P Y (cid:18) exp (cid:2) − K (cid:0) P X , P X | Y (cid:1)(cid:3)(cid:19) = P Y exp (cid:26) P X (cid:20) log (cid:18) d P Y | X d P Y (cid:19)(cid:21)(cid:27)! . Obviously, Z ∈ [0 , . If Z = 0 , then inf Q Y ∈ M ( Y ) P X (cid:2) K ( Q Y , P Y | X ) (cid:3) = + ∞ . Otherwise,
Z > and for any Q Y ∈ M ( Y ) , P X (cid:2) K (cid:0) Q Y , P Y | X (cid:1)(cid:3) = K (cid:0) Q Y , Q ⋆Y (cid:1) + P X (cid:2) K (cid:0) Q ⋆Y , P Y | X (cid:1)(cid:3) = K (cid:0) Q Y , Q ⋆Y (cid:1) + log (cid:0) Z − (cid:1) , where Q ⋆Y ≪ P Y is defined by the relation d Q ⋆Y d P Y = Z − exp (cid:2) − K (cid:0) P X , P X | Y (cid:1)(cid:3) = Z − exp ( P X " log d P Y | X d P Y ! . (6) EW BOUNDS FOR K -MEANS Consequently inf Q Y ∈ M ( Y ) P X (cid:2) K ( Q Y , P Y | X ) (cid:3) = P X (cid:2) K ( Q ⋆Y , P Y | X ) (cid:3) = log (cid:0) Z − (cid:1) < ∞ , The probability measure Q ⋆Y represents the geometric mean of P Y | X with respect to P X . P ROOF . By Lemma 1 on page 2,(7) P X (cid:2) K (cid:0) Q Y , P Y | X (cid:1)(cid:3) = K (cid:0) P X ⊗ Q Y , P X, Y (cid:1)(cid:3) = K (cid:0) Q Y , P Y (cid:1) + Q Y (cid:2) K (cid:0) P X , P X | Y (cid:1)(cid:3) . Thus, when (7) is finite, Q Y ≪ P Y and Q Y h K (cid:0) P X , P X | Y (cid:1) < + ∞ (cid:1)i = 1 , so that P Y h K (cid:0) P X , P X | Y (cid:1) < + ∞ (cid:1)i > , implying that Z > . Assuming from now on that (7) is finite, introduce A = n y : K (cid:0) P X , P X | Y = y (cid:1) < + ∞ o . From Lemma 7 on page 10 and (7), for any Q Y ∈ M ( A ) , P X (cid:2) K (cid:0) Q Y , P Y | X (cid:1)(cid:3) = − log (cid:2) P Y ( A ) (cid:3) − log P Y | Y ∈ A n exp h − K (cid:0) P X , P X | Y (cid:1)io| {z } =log (cid:0) Z − (cid:1) + K (cid:0) Q Y , Q ⋆Y (cid:1) = P X (cid:2) K (cid:0) Q ⋆Y , P Y | X (cid:1)(cid:3) + K (cid:0) Q Y , Q ⋆Y (cid:1) . Moreover, when Q Y ( A ) < , Q Y Q ⋆Y , so that both members are equal to + ∞ . The iden-tity (6) is a consequence of Lemma 6 on page 9.We are now ready to express the minimum of the information k -means criterion in differentways involving the underlying classification function and optimal centers.P ROPOSITION The information k -means problem can be expressed as inf q ∈ (cid:0) L , ( ν ) (cid:1) k P X (cid:16) min j ∈ J ,k K K ( q j , p X ) (cid:17) = inf ℓ : X J ,k K inf ( q ,...,q k ) ∈ (cid:0) L , ( ν ) (cid:1) k P X (cid:16) K ( q ℓ ( X ) , p X ) (cid:17) = inf ( q ,...,q k ) ∈ (cid:0) L , ( ν ) (cid:1) k inf ℓ : X J ,k K P X (cid:16) K ( q ℓ ( X ) , p X ) (cid:17) = inf ( q ,...,q k ) ∈ (cid:0) L , ( ν ) (cid:1) k P X (cid:16) K ( q ℓ ⋆q ( X ) , p X ) (cid:17) = inf ℓ : X J ,k K P X (cid:16) K ( q ⋆,ℓℓ ( X ) , p X ) (cid:17) = inf ℓ : X J ,k K P X (cid:16) log (cid:0) Z − ℓ ( X ) (cid:1)(cid:17) , where the infimum in ℓ is taken on measurable classification functions ℓ , where ℓ ⋆q : X J , k K is the best classification function for a fixed q = ( q , . . . , q k ) defined as ℓ ⋆q ( x ) = arg min j ∈ J ,k K K ( q j , p x ) , x ∈ X , whereas q ⋆,ℓ , . . . , q ⋆,ℓk are the best information k -means centers with respect to ℓ ( X ) definedas q ⋆,ℓj = Z − j exp n P X | ℓ ( X )= j (cid:2) log( p X ) (cid:3)o , j ∈ J , k K , where Z j = Z exp n P X | ℓ ( X )= j (cid:2) log( p X ) (cid:3)o d ν, with the convention that q ⋆,ℓj can be given any arbitrary value in the case when Z j = 0 ,the corresponding criterion being in this case infinite. Besides, we have the followingPythagorean identity P X (cid:16) K ( q ℓ ( X ) , p X ) (cid:17) = P X (cid:16) K ( q ⋆,ℓℓ ( X ) , p X ) (cid:17) + P X (cid:16) K (cid:0) q ℓ ( X ) , q ⋆,ℓℓ ( X ) (cid:1)(cid:17) . P ROOF . This proposition is a straightforward consequence of Lemma 8 on page 12 ap-plied to P X, Y | ℓ ( X )= j .It may be of some help to state the empirical counterpart of the previous proposition, whereformulas are somehow more explicit.C OROLLARY
Let X , . . . , X n be an i.i.d sample drawn from P X . Then, the empiricalversion of the information k -means problem tries to partition the observations p X , . . . , p X n into k -clusters, what is expressed here by inf q ∈ (cid:0) L , ( ν ) (cid:1) k n n X i =1 min j ∈ J ,k K K (cid:0) q j , p X i (cid:1) = inf ℓ : J ,n K → J ,k K inf q ∈ (cid:0) L , ( ν ) (cid:1) k n n X i =1 K (cid:0) q ℓ ( i ) , p X i (cid:1) = inf q ∈ (cid:0) L , ( ν ) (cid:1) k inf ℓ : J ,n K → J ,k K n n X i =1 K (cid:0) q ℓ ( i ) , p X i (cid:1) = inf q ∈ (cid:0) L , ( ν ) (cid:1) k n n X i =1 K (cid:0) q ℓ ⋆q ( i ) , p X i (cid:1) = inf ℓ : J ,n K → J ,k K n n X i =1 K (cid:0) q ⋆,ℓℓ ( i ) , p X i (cid:1) = inf ℓ : J ,n K → J ,k K k X j =1 (cid:12)(cid:12) ℓ − ( j ) (cid:12)(cid:12) n log (cid:0) Z − j (cid:1) , where ℓ ⋆q : X J , k K is the best classification function for a fixed q = ( q , . . . , q k ) defined as ℓ ⋆q ( i ) = arg min j ∈ J ,k K K ( q j , p X i ) whereas q ⋆,ℓj , j ∈ J , k K are the information k -means centers defined as q ⋆,ℓj = Z − j Y i ∈ ℓ − ( j ) p X i ! / | ℓ − ( j ) | , where Z j = Z Y i ∈ ℓ − ( j ) p X i ! / | ℓ − ( j ) | d ν. EW BOUNDS FOR K -MEANS P ROOF . Apply the previous proposition to the empirical measure P X = 1 n n X i =1 δ X i of thesample X , . . . , X n .We will now see that when the sample is in L ( ν ) and has a finite second moment, theoptimal centers for a given classification function are also in L ( ν ) , so that the optimizationof the centers can be reduced to this space.L EMMA
Let us assume that P X (cid:16)R p X d ν (cid:17) < ∞ . Then, the optimal centers q ⋆,ℓj de-fined in the previous lemma verify q ⋆,ℓj ∈ L ( ν ) . Furthermore, in this case inf (cid:26) P X (cid:16) min j ∈ J ,k K K ( q j , p X ) (cid:17) : q ∈ (cid:16) L , ( ν ) (cid:17) k (cid:27) = inf (cid:26) P X (cid:16) min j ∈ J ,k K K ( q j , p X ) (cid:17) : q ∈ (cid:16) L , ( ν ) ∩ L ( ν ) (cid:17) k (cid:27) . P ROOF . Apply Jensen’s inequality and the Fubini-Tonelli theorem to obtain that q ⋆,ℓj ∈ L ( ν ) . Indeed, for any j ∈ J , k K , if Z j = 0 , we can pick up any value for q ⋆,ℓj , and in paticulara value in L ( ν ) , in the same way if P X ( ℓ ( X ) = j ) = 0 , we can make an arbitrary choice for q ⋆,ℓj , otherwise, Z j > , and Z ( q ⋆,ℓj ) d ν = Z − j Z exp (cid:26) P X | ℓ ( X )= j (cid:2) log( p X ) (cid:3)(cid:27) d ν ≤ Z − j P X | ℓ ( X )= j (cid:18)Z p X d ν (cid:19) ≤ Z − j P X (cid:0) ℓ ( X ) = j (cid:1) − P X (cid:18)Z p X d ν (cid:19) < ∞ Then according to Proposition 9 on page 13 P X h min j ∈ J ,k K K (cid:0) q j , p X (cid:1)i = inf ℓ : X J ,k K P X h K (cid:0) q ℓ ( X ) , p X (cid:1)i ≥ inf ℓ : X J ,k K P X h K (cid:0) q ⋆,ℓℓ ( X ) , p X (cid:1)i ≥ inf ℓ : X J ,k K P X h min j ∈ J ,k K K (cid:0) q ⋆,ℓj , p X (cid:1)i , showing that we can restrict the optimization to q j ∈ L ( ν ) .
5. PAC-Bayesian generalization bounds for the linear k -means criterion. In this sec-tion, we derive non asymptotic generalization bounds for the linear k -means criterion definedhereafter.D EFINITION
Given a random vector W in a separable Hilbert space H and abounded measurable set of parameters Θ ⊂ H k , the k -means linear criterion is defined as P W (cid:16) min j ∈ J ,k K h θ j , W i (cid:17) , θ ∈ Θ . If W , . . . , W n are n independent copies of W , the empirical linear k -means criterion isdefined by taking the expectation with respect to the empirical measure P W = n P ni =1 δ W i instead of integrating with respect to P W . Using a change of representation based on the kernel trick, we will show that all the criteriawe defined so far can be rewritten as linear k -means criteria in suitable spaces of coordinates.Consequently, our approach will be to prove a generalization bound for the linear k -meanscriterion and to study its consequences for the other criteria.To reach a O (cid:0)p k/n (cid:1) speed up to logarighmic factors, we will borrow ideas from theclassical chaining method used to upper bound the expected supremum of Gaussian pro-cesses (see [7]). However, we will transpose the idea of chaining into the setting of PAC-Bayesian deviation inequalities. To obtain dimension free bounds, we will use a sequence ofperturbations of the parameter by isonormal processes with a variance parameter ranging ina geometric grid. This multiscale perturbation scheme will play the same role as the δ -nets inclassical chaining.Let us begin with an existence result.P ROPOSITION
In the setting of Definition 12 on the preceding page, let us assumethat k W k ∞ = ess sup P W k W k < + ∞ . There is θ ∗ ∈ Θ , the weak closure of Θ , such that P W (cid:16) min j ∈ J ,k K h θ ∗ j , W i (cid:17) = inf θ ∈ Θ h P W (cid:16) min j ∈ J ,k K h θ j , W i (cid:17)i . Moreover k θ ∗ k ≤ k Θ k def = sup θ ∈ Θ k θ k . P ROOF . This is inspired by the proof of Theorem 3.2 in [16]. Let us begin with the secondstatement. Since θ θ k = sup θ ′ ∈ H k , k θ ′ k =1 h θ ′ , θ i is weakly lower semicontiuous, k Θ k ≤ k Θ k , so that in particular k θ ∗ k ≤ k Θ k . Moreover, forany w ∈ H , H k −→ R θ min j ∈ J ,k K h θ j , w i is weakly continuous, since, by definition of the weak topology of H k , θ
7→ h θ j , w i areweakly continuous, and taking a finite minimum is a continuous operation.Let ( θ n ) n ∈ N be a bounded sequence in H k , converging weakly to θ . By the dominatedconvergence theorem lim n →∞ P W (cid:16) min j ∈ J ,k K h θ n, j , W i (cid:17) = P W (cid:16) lim n →∞ min j ∈ J ,k K h θ n,j , W i (cid:17) = P W (cid:16) min j ∈ J ,k K h θ j , W i (cid:17) , since (cid:12)(cid:12) min j ∈ J ,k K h θ n,j , W i (cid:12)(cid:12) ≤k θ n k k W k ∞ . Thus R : θ P W (cid:0) min j ∈ J ,k K h θ j , W i (cid:1) is weakly continuous on Θ . But the unit ball, and therefore any ball of H k , is weakly compact,so that Θ being weakly closed and bounded is also weakly compact. Consequently, R reachesits minimum on Θ at some (non necessarily unique) point θ ∗ ∈ Θ . Therefore R ( θ ∗ ) = inf θ ∈ Θ R ( θ ) = inf θ ∈ Θ R ( θ ) , EW BOUNDS FOR K -MEANS the last equality being due to the fact that R is weakly continuous.Note that we used the weak topology, since the unit ball of H k is not strongly compactwhen the dimension of H is infinite.We will prove generalization bounds based on the following PAC-Bayesian lemma. Wewill use it as a workhorse to produce all the deviation inequalities necessary to achieve ourgoals. Combined with Jensen’s inequality, it will also produce bounds in expectation.L EMMA
Consider two measurable spaces T and W , a prior probability measure π ∈ M ( T ) defined on T , and a measurable function h : T × W → R . Let W ∈ W be arandom variable and let ( W , . . . , W n ) be a sample made of n independent copies of W . Let λ be a positive real parameter. (8) P W , ..., W n ( exp " sup ρ ∈ M ( T ) sup η ∈ N (cid:26) Z min n η, − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)io d ρ ( θ ′ ) − K ( ρ, π ) (cid:27) ≤ . Consequently, for any δ ∈ ]0 , , with probability at least − δ , (9) sup ρ ∈ M ( T ) sup η ∈ N (cid:26) Z min n η, − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)io d ρ ( θ ′ ) − K ( ρ, π ) (cid:27) ≤ log( δ − ) . Note that the role of η in this formula is to give a meaning to the integration with respectto ρ in all circumstances.P ROOF . We follow here the same arguments as in the proof of Proposition 1.7 in [19].Remark that the supremum in ρ can be restricted to the case when K ( ρ, π ) < ∞ , and recallthat in this case ρ ≪ π and K ( ρ, π ) = Z log (cid:18) d ρ d π ( θ ′ ) (cid:19) d ρ ( θ ′ ) . Note also that Z (cid:16) d ρ d π ( θ ′ ) > (cid:17) d ρ ( θ ′ ) = Z (cid:16) d ρ d π ( θ ′ ) > (cid:17) d ρ d π ( θ ′ ) d π ( θ ′ ) = Z d ρ d π ( θ ′ ) d π ( θ ′ ) = 1 . Applying Jensen’s inequality, we get exp ( sup ρ ∈ M ( T ) sup η ∈ N Z min (cid:26) η, − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)i(cid:27) d ρ ( θ ′ ) − K ( ρ, π ) ) ≤ sup η ∈ N sup ρ ∈ M ( T ) K ( ρ,π ) < ∞ Z exp ( min (cid:26) η, − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)i(cid:27)) d ρ d π ( θ ′ ) − d ρ ( θ ′ ) = sup η ∈ N sup ρ ∈ M ( T ) K ( ρ,π ) < ∞ Z exp ( min (cid:26) η, − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)i(cid:27)) (cid:18) d ρ d π ( θ ′ ) > (cid:19) d π ( θ ′ ) ≤ sup η ∈ N Z exp ( min (cid:26) η, − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)i(cid:27) d π ( θ ′ )= monotoneconvergence Z exp ( − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)i) d π ( θ ′ ) . Let us put Y ′ = sup ρ ∈ M ( T ) sup η ∈ N (cid:26) Z min n η, − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)io d ρ ( θ ′ ) − K ( ρ, π ) (cid:27) and Y = log Z exp ( − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)i) d π ( θ ′ ) . We just proved that Y ′ ≤ Y . Moreover, Y is measurable, according to Fubini’s theorem fornon-negative functions. Therefore Y is a random variable. Note that we did not prove that Y ′ itself is measurable. Remark now that P W , ... ,W n (cid:2) exp( Y ) (cid:3) = P W , ... ,W n Z exp ( − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)i) d π ( θ ′ ) , = Fubini Z P W , ... ,W n exp ( − λ n X i =1 h ( θ ′ , W i ) − n log h P W exp (cid:2) − λh ( θ ′ , W ) (cid:3)i) d π ( θ ′ )= Z (cid:18) P W h exp (cid:16) − λh ( θ ′ , W ) (cid:17)i < + ∞ (cid:19) n Y i =1 P W i (cid:2) exp (cid:0) − λh ( θ ′ , W i ) (cid:1)(cid:3) P W h exp (cid:16) − λh ( θ ′ , W ) (cid:1)(cid:3) d π ( θ ′ ) ≤ , proving the first part of the lemma. From Markov’s inequality, P (cid:0) Y ≥ log( δ − ) (cid:1) ≤ δ P W , ... ,W n (cid:2) exp( Y ) (cid:3) ≤ δ. Consequently P (cid:0) Y ≤ log( δ − ) (cid:1) ≥ − δ . We have proved that the non necessarily measur-able event Y ′ ≤ log( δ − ) contains the measurable event Y ≤ log( δ − ) whose probability isat least − δ .We are now ready to state and prove our generalization bounds for the linear k -meanscriterion. EW BOUNDS FOR K -MEANS L EMMA
Let W be a random vector in a separable Hilbert space H . Let ( W , . . . , W n ) be a sample made of n independent copies of W . Let Θ ⊂ H k be a bounded measurable setof parameters. Define k Θ k = sup (cid:26)(cid:18) k X j =1 k θ j k (cid:19) / : θ ∈ Θ (cid:27) < ∞ and assume that, for some real valued parameters a and b , P W (cid:16) min j ∈ J ,k K h θ j , W i ∈ [ a, b ] for any θ ∈ Θ (cid:17) = 1 . Assume also that k W k ∞ def = ess sup P W k W k < ∞ .Our first result gives an observable upper bound for the k -means criterion, provided thatthe above parameters are known or upper bounded by known quantities.For any k ≥ , any n ≥ k and any δ ∈ ]0 , , with probability at least − δ , for any θ ∈ Θ , P W (cid:16) min j ∈ J ,k K h θ j , W i (cid:17) ≤ P W (cid:16) min j ∈ J ,k K h θ j , W i (cid:17) + log( n/k )log(2) r k ) n + 2 r log( k ) n ! k Θ kk W k ∞ + vuut ( √ (cid:16) k ( b − a ) + 2 log( ek ) k W k ∞ k Θ k (cid:17) n + r log( δ − )2 n ( b − a ) , where P W = 1 n n X i =1 δ W i is the empirical measure.Our second result deals with the excess risk with respect to a non random reference pa-rameter θ ∗ ∈ Θ .If θ ∗ ∈ Θ is a non random value of the parameter, with probability at least − δ , for any θ ∈ Θ , (cid:16) P W − P W (cid:17)(cid:16) min j ∈ J ,k K h θ j , W i − min j ∈ J ,k K h θ ∗ j , W i (cid:17) ≤ log( n/k )log(2) r k ) n + 2 r log( k ) n ! k Θ kk W k ∞ + vuut ( √ (cid:16) k ( b − a ) + 2 log( ek ) k W k ∞ k Θ k (cid:17) n + r δ − ) n ( b − a ) . Our third result draws the consequences of this excess risk bound for an ε -minimizer b θ .In the case when the estimator b θ ( W , . . . , W n ) ∈ Θ is such that P W ,..., W n almost surely P (cid:16) min j ∈ J ,k K h b θ j , W i (cid:17) ≤ inf θ ∈ Θ P (cid:16) min j ∈ J ,k K h θ j , W i (cid:17) + ε, P W (cid:16) min j ∈ J ,k K h b θ j , W i (cid:17) − inf θ ∈ Θ P W (cid:16) min j ∈ J ,k K h θ j , W i (cid:17) − ε satisfies the same bound with at leastthe same probability. Moreover, the expected excess risk satisfies P W ,...,W n h P W (cid:16) min j ∈ J ,k K h b θ j , W i (cid:17) − inf θ ∈ Θ P W (cid:16) min j ∈ J ,k K h θ j , W i (cid:17)i ≤ log( n/k )log(2) r k ) n + 2 r log( k ) n ! k Θ k k W k ∞ + vuut (cid:0) √ (cid:1)(cid:16) k ( b − a ) + 2 log( ek ) k W k ∞ k Θ k (cid:17) n + ε. P ROOF . Assume without loss of generality that H = ℓ ⊂ R N . Let ρ θ ′ | θ = P θ i + β − / ε i ,i ∈ N , θ ∈ R N , be a Gaussian conditional probability distribution with values in M (cid:0) R N (cid:1) , where ε i , i ∈ N is an infinite sequence of independent standard normal random variables. When θ and θ ′ ∈ R k × N are made of k infinite sequences of real numbers, let ρ θ ′ | θ = k O j =1 ρ θ ′ j | θ j be the tensor product of the previously defined conditional probability distributions. Let W bea random vector in the separable Hilbert space ℓ ⊂ R N . Consider the measurable functions f ( θ, w ) = min j ∈ J ,k K h θ j , w i , θ ∈ R k × N , w ∈ ℓ , where the scalar product is extended beyond ℓ as follows. For any u, v ∈ R N , let us define h u, v i as h u, v i = lim s →∞ s X t =0 u t v t , when lim sup s →∞ s X t =0 u t v t = lim inf s →∞ s X t =0 u t v t ∈ R . , otherwise.Remark that this extension is measurable, but not bilinear.Our strategy will be to decompose the opposite of the centered empirical risk (cid:0) P W − P W (cid:1)(cid:2) f ( θ, W ) (cid:3) into(10) (cid:0) P W − P W (cid:1)(cid:2) f ( θ, W ) (cid:3) = (cid:0) P W − P W (cid:1)(cid:0) δ θ ′ | θ − ρ θ ′ | θ | {z } small perturbation (cid:1)(cid:2) f ( θ ′ , W ) (cid:3) + p X q =1 (cid:0) P W − P W (cid:1)(cid:0) ρ q − θ ′ | θ − ρ q θ ′ | θ | {z } chain of intermediate scales (cid:1)(cid:2) f ( θ ′ , W ) (cid:3) + (cid:0) P W − P W (cid:1) ρ p θ ′ | θ | {z } big perturbation (cid:2) f ( θ ′ , W ) (cid:3) , where δ θ ′ | θ is the Dirac (or identity) transition kernel and ρ q θ ′ | θ is the transition kernel ρ θ ′ | θ iterated q times.Let f ( θ, w ) = f ( θ, w ) − P W (cid:0) f ( θ, W ) (cid:1) , θ ∈ R k × N , w ∈ ℓ , EW BOUNDS FOR K -MEANS be the centered loss function.We will first apply the PAC-Bayesian inequalities of Lemma 14 on page 17 to the function h ( θ ′ , w ) = (cid:0) δ θ ′′ | θ ′ − ρ θ ′′ | θ ′ (cid:1)(cid:2) f ( θ ′′ , w ) (cid:3) = f ( θ ′ , w ) − ρ θ ′′ | θ ′ (cid:2) f ( θ ′′ , w ) (cid:3) , θ ′ ∈ R k × N , w ∈ ℓ and to the reference measure π θ ′ = ρ θ ′ | θ =0 .L EMMA
The function h satisfies (cid:0) π θ ′ ⊗ P W (cid:1)(cid:16) | h ( θ ′ , W ) | ≤ p k ) /β k W k ∞ (cid:17) = 1 , where k W k ∞ = ess sup P W k W k . P ROOF . Remark that for any w ∈ ℓ , π θ ′ almost surely, (cid:0) δ θ ′′ | θ ′ − ρ θ ′′ | θ ′ (cid:1)(cid:2) f ( θ ′′ , w ) (cid:3) = ρ θ ′′ | θ ′ (cid:16) min j h θ ′ j , w i − min j h θ ′′ j , w i (cid:17) ≤ ρ θ ′′ | θ ′ (cid:16) max j h θ ′′ j − θ ′ j , w i (cid:17) , since in this situation, the first case in the extended definition of the scalar product applieswith probability one (according to Kolmogorov’s three series theorem). Considering that un-der ρ θ ′′ | θ ′ , h θ ′′ j − θ ′ j , w i , j ∈ J , k K are k independent centered real normal random variableswith variance k w k /β and applying a classical maximal inequality for the expectation of themaximum of k standard normal variables (see section 2.5 in [7]), we get that (cid:0) δ θ ′′ | θ ′ − ρ θ ′′ | θ ′ (cid:1)(cid:2) f ( θ ′ , w ) (cid:3) ≤ p k ) /β k w k . Reasoning in a similar way for the opposite, we get − (cid:0) δ θ ′′ | θ ′ − ρ θ ′′ | θ ′ (cid:1)(cid:2) f ( θ ′ , w ) (cid:3) ≤ ρ θ ′′ | θ ′ (cid:16) max j h θ ′ j − θ ′′ j , w i (cid:17) ≤ p k ) /β k w k . The lemma follows from the definition of h .Applying Lemma 14 on page 17 to h : R k × N × ℓ → R , π = ρ θ ′ | θ =0 and restricting thesupremum in ρ ∈ M (cid:0) R k × N (cid:1) to ρ ∈ (cid:8) ρ θ ′ | θ : θ ∈ ( ℓ ) k (cid:9) , we get P W , ... ,W n ( exp sup θ ∈ ( ℓ ) k " nλ (cid:0) P W − P W (cid:1)(cid:0) ρ θ ′ | θ − ρ θ ′ | θ (cid:1) f ( θ ′ , W ) − nρ θ ′ | θ (cid:20) log (cid:18) P W h exp (cid:16) − λh (cid:0) θ ′ , W (cid:1)i(cid:19)(cid:21) − β k θ k ≤ , where we have let η go to + ∞ , using monotone convergence (since h is bounded from theprevious lemma) and where we have computed K (cid:0) ρ θ ′ | θ , π (cid:1) = k X j =1 K (cid:0) ρ θ ′ j | θ j , ρ θ ′ j | (cid:1) = k X j =1 X i ∈ N K (cid:0) N ( θ j,i , β − ) , N (0 , β − ) (cid:1) = β k X j =1 X i ∈ N θ j,i = β k θ k . Apply now Jensen’s inequality and devide by nλ to get P W , ... ,W n ( sup θ ∈ ℓ k "(cid:0) P W − P W (cid:1)(cid:0) ρ θ ′ | θ − ρ θ ′ | θ (cid:1) f ( θ ′ , W ) − λ − ρ θ ′ | θ (cid:20) log (cid:18) P W h exp (cid:16) − λh (cid:0) θ ′ , W (cid:1)i(cid:19)(cid:21) − β k θ k nλ ≤ . From Hoeffding’s inequality, since P W (cid:0) h ( θ ′ , W ) (cid:1) = 0 , π θ ′ almost surely, P W h exp (cid:16) − λh ( θ ′ , W ) (cid:17)i ≤ exp (cid:16) λ P W h ( θ ′ , W ) (cid:17) ≤ exp (cid:16) λ β log( k ) k W k ∞ (cid:17) . Considering a measurable bounded subset Θ ⊂ ( ℓ ) k , we deduce that P W , ... ,W n (cid:20) sup θ ∈ Θ (cid:0) P W − P W (cid:1)(cid:0) ρ θ ′ | θ − ρ θ ′ | θ (cid:1) f ( θ ′ , W ) (cid:21) ≤ λβ log( k ) k W k ∞ + β k Θ k nλ . In order to minimize the right-hand side, choose λ = β k Θ k p n log( k ) k W k ∞ and define(11) F = k W k ∞ k Θ k r k ) n . We get P W , ... ,W n (cid:20) sup θ ∈ Θ (cid:0) P W − P W (cid:1)(cid:0) ρ θ ′ | θ − ρ θ ′ | θ (cid:1) f ( θ ′ , W ) (cid:21) ≤ F. For any integer q , the iterated transition kernel ρ q θ ′ | θ is equal to ρ θ ′ | θ with β replaced by − q β . As F is independent of β , we therefore deduce that P W , ... ,W n (cid:26) sup θ ∈ Θ h(cid:16) P W − P W (cid:17)(cid:16) ρ q − θ ′ | θ − ρ q θ ′ | θ (cid:17) f (cid:0) θ ′ , W (cid:1)i(cid:27) ≤ F. Summing up for q = 1 to p , where p is to be chosen later, and exchanging P q and sup θ , wededuce that P W , ... ,W n (cid:26) sup θ ∈ Θ h(cid:16) P W − P W (cid:17)(cid:16) ρ θ ′ | θ − ρ p θ ′ | θ (cid:17) f (cid:0) θ ′ , W (cid:1)i(cid:27) ≤ P W , ... ,W n (cid:26) p X q =1 sup θ ∈ Θ h(cid:16) P W − P W (cid:17)(cid:16) ρ q − θ ′ | θ − ρ q θ ′ | θ (cid:17) f (cid:0) θ ′ , W (cid:1)i(cid:27) ≤ pF. As we are interested in bounding from above (cid:0) P W − P W (cid:1) f ( θ, W ) , according to the decom-position formula (10) on page 20, there remains to upper bound (cid:0) P W − P W (cid:1)(cid:0) δ θ ′ | θ − ρ θ ′ | θ (cid:1)(cid:2) f ( θ ′ , W ) (cid:3) (12) and (cid:0) P W − P W (cid:1) ρ p θ ′ | θ (cid:2) f ( θ ′ , W ) (cid:3) , (13) or with a change of notation (cid:0) P W − P W (cid:1) ρ θ ′ | θ (cid:2) f ( θ ′ , W ) (cid:3) . (14) EW BOUNDS FOR K -MEANS An almost sure bound for (12) is provided by Lemma 16 on page 21, since (12) is equal to P W (cid:2) h ( θ, W ) (cid:3) . To bound (14), introduce the influence function(15) ψ ( x ) = ( log (cid:0) x + x / (cid:1) , x ≥ , − log(1 − x + x / (cid:1) , x ≤ and put e f ( θ, W ) = f ( θ, W ) − a + b . The function ψ is chosen to be symmetric and to satisfy(16) ψ ( x ) ≤ log (cid:0) x + x / (cid:1) , x ∈ R , since we can check that log (cid:0) x + x / (cid:1) + log (cid:0) − x + x / (cid:1) = log (cid:2)(cid:0) x / (cid:1) − x (cid:3) = log (cid:0) x / (cid:1) ≥ . Decompose (14) into (cid:16) P W − P W (cid:17) ρ θ ′ | θ f ( θ ′ , W ) = ρ θ ′ | θ (cid:16) P W − P W (cid:17) e f ( θ ′ , W )= ρ θ ′ | θ h P W e f ( θ ′ , W ) − P W (cid:16) λ − ψ (cid:2) λ e f ( θ ′ , W ) (cid:3)(cid:17)i (17) + ρ θ ′ | θ P W h λ − ψ (cid:2) λ e f ( θ ′ , W ) (cid:3) − e f ( θ ′ , W ) i . (18)In order to bound (18), note that from lemma 7.2 in [10] (cid:12)(cid:12) x − ψ ( x ) (cid:12)(cid:12) ≤ x √ , x ∈ R . (19)Therefore, from the inequalities ( a + b ) ≤ a + 2 b and min j a j − min j b j ≤ max j ( a j − b j ) ,so that (min j a j − min j b j ) ≤ max j ( a j − b j ) , for any θ ∈ (cid:0) ℓ (cid:1) k , P W almost surely, ρ θ ′ | θ h λ − ψ (cid:2) λ e f ( θ ′ , W ) (cid:3) − e f ( θ ′ , W ) i ≤ λ √ ρ θ ′ | θ (cid:2) e f ( θ ′ , W ) (cid:3) ≤ λ √ h(cid:0) min j h θ j , W i − ( a + b ) / (cid:1) + ρ θ ′ | θ (cid:16) max j h θ ′ j − θ j , W i (cid:17)i . At this point, it remains to bound the variance term ρ θ ′ | θ (cid:16) max j h θ ′ j − θ j , W i (cid:17) . Let us remarkthat ρ θ ′ | θ ◦ (cid:0) θ ′
7→ h θ ′ j − θ j , W i kj =1 (cid:1) − = N (cid:16) , k W k /β (cid:17) ⊗ k . In other words, under ρ θ ′ | θ , the sequence (cid:0) h θ ′ j − θ j , W i , ≤ j ≤ k (cid:1) is made of k independentcentered normal random variables with variance k W k /β . Therefore, we need the followingmaximal inequality.L EMMA
Let ( ε , . . . , ε k ) be a sequence of Gaussian random variables such that ε j ∼ N (0 , σ ) . We have E (cid:0) max ≤ j ≤ k ε j (cid:1) ≤ σ log( ek ) . P ROOF . E (cid:16) max ≤ j ≤ k ε j (cid:17) = Z R + P (cid:16) max ≤ j ≤ k ε j > t (cid:17) d t ≤ Z R + min (cid:26) k X j =1 P (cid:0) ε j > t (cid:1) , (cid:27) d t ≤ Z R + min n k P (cid:0) ε > √ t (cid:1) , o d t ≤ Z R + min n k exp (cid:16) − t σ (cid:17) , o d t ≤ σ log( k ) + Z + ∞ σ log( k ) k exp (cid:16) − t σ (cid:17) d t ≤ σ log( k ) + 2 σ = 2 σ log( ek ) . Accordingly, we obtain P W almost surely,(20) ρ θ ′ | θ h λ − ψ (cid:2) λ e f ( θ ′ , W ) (cid:3) − e f ( θ ′ , W ) i ≤ λ √ h(cid:0) min j h θ j , W i − ( a + b ) / (cid:1) + ρ θ ′ | θ (cid:16) max j h θ ′ j − θ j , W i (cid:17)i ≤ λ √ (cid:2) ( b − a ) / ek ) k W k ∞ /β (cid:3) . The right-hand side of this inequality provides an almost sure upper bound for (18). Tobound (17), or rather the expectation of an exponential moment of (17), we can write a PAC-Bayesian bound using the influence function ψ . According to Lemma 14 on page 17, P W , ... ,W n ( exp sup θ ∈ Θ " − nλρ θ ′ | θ P W (cid:16) λ − ψ (cid:2) λ e f ( θ ′ , W ) (cid:3)(cid:17) − nρ θ ′ | θ (cid:20) log (cid:18) P W h exp (cid:16) ψ (cid:2) − λ e f (cid:0) θ ′ , W (cid:1)(cid:3)(cid:17)i(cid:19)(cid:21) − β k θ k ≤ . Indeed, it is easy to check that the integrand of ρ θ ′ | θ is integrable, so that we can applythe monotone convergence theorem to remove η from the equation produced by Lemma14. Using the bound (16) on page 23 and removing the exponential according to Jensen’sinequality, we obtain P W , ... ,W n ( sup θ ∈ Θ " ρ θ ′ | θ h P W (cid:16) e f (cid:0) θ ′ , W (cid:1)(cid:17) − P W (cid:16) λ − ψ (cid:2) λ e f ( θ ′ , W ) (cid:3)(cid:17)i − λ ρ θ ′ | θ h P W (cid:16) e f ( θ ′ , W ) (cid:17)i ≤ β k Θ k . Using the maximal inequality stated in Lemma 17 on the previous page to bound the varianceterm, we get P W , ... ,W n (cid:26) sup θ ∈ Θ ρ θ ′ | θ h P W (cid:16) e f (cid:0) θ ′ , W (cid:1)(cid:17) − P W (cid:16) λ − ψ (cid:2) λ e f ( θ ′ , W ) (cid:3)(cid:17)i(cid:27) ≤ λ h ( b − a ) / ek ) k W k ∞ /β i + β k Θ k nλ . EW BOUNDS FOR K -MEANS This provides an upper bound for (17). Combining it with the upper bound for (18) gives anupper bound for (14) that reads P W , ... ,W n (cid:26) sup θ ∈ Θ (cid:0) P W − P W (cid:1) ρ θ ′ | θ (cid:2) f ( θ ′ , W ) (cid:3)(cid:27) ≤ ( √ λ h ( b − a ) / ek ) k W k ∞ /β i + β k Θ k nλ . Choosing λ = s β k Θ k ( √ (cid:2) ( b − a ) + 8 log( ek ) k W k ∞ /β (cid:3) n gives P W , ... ,W n (cid:26) sup θ ∈ Θ (cid:0) P W − P W (cid:1) ρ θ ′ | θ (cid:2) f ( θ ′ , W ) (cid:3)(cid:27) ≤ e F ( β ) def = vuut ( √ (cid:16) β ( b − a ) + 8 log( ek ) k W k ∞ (cid:17) k Θ k n . Putting everything together, we obtain P W , ... ,W n n sup θ ∈ Θ (cid:0) P W − P W (cid:1) f ( θ, W ) o ≤ p k ) /β k W k ∞ + e F (2 − p β ) + pF, where F is defined by equation (11) on page 22.Let us choose β = 2 n k Θ k − and p = (cid:4) log( n/k ) / log(2) (cid:5) , so that − p β ≤ k k Θ k − . We get P W , ... ,W n n sup θ ∈ Θ (cid:0) P W − P W (cid:1) f ( θ, W ) o ≤ log( n/k )log(2) r k ) n + 2 r log( k ) n ! k Θ kk W k ∞ + vuut ( √ (cid:16) k ( b − a ) + 2 log( ek ) k W k ∞ k Θ k (cid:17) n . The upper deviations from this mean are controled by the extension of Hoeffding’s boundcalled the bounded difference inequality (see section 6.1 and theorem 6.2 in [7]). It giveswith probability at least − δ sup θ ∈ Θ (cid:0) P W − P W (cid:1) f ( θ, W ) ≤ P W , ... ,W n n sup θ ∈ Θ (cid:0) P W − P W (cid:1) f ( θ, W ) o + r δ − ) n ( b − a ) . This proves the first statement of the lemma. To get the second one, add to the previousinequality P W , ... ,W n n(cid:16) P W − P W (cid:17) f ( θ ∗ , W ) o = 0 to get P W , ... ,W n n sup θ ∈ Θ (cid:0) P W − P W (cid:1)(cid:16) f ( θ, W ) − f ( θ ∗ , W ) (cid:17)o ≤ log( n/k )log(2) r k ) n + 2 r log( k ) n ! k Θ kk W k ∞ + vuut ( √ (cid:16) k ( b − a ) + 2 log( ek ) k W k ∞ k Θ k (cid:17) n , and apply the bounded difference inequality to get the deviations. To prove the end of theproposition concerning an estimator b θ , apply what is already proved to the weak closure Θ of Θ and to θ ∗ ∈ arg min θ ∈ Θ P W (cid:16) min j ∈ J , k K h θ j , W i (cid:17) that exists due to Proposition 13 on page 16.
6. Generalization bounds for the quadratic k -means criterion. The most obviousapplication of the previous lemma is to get a dimension free bound for the usual quadratic k -means criterion.P ROPOSITION
Consider a random vector X in a separable Hilbert space H . Let ( X , . . . , X n ) be a sample made of n independent copies of X . Consider the ball of radius B B = (cid:8) x ∈ H : k x k ≤ B (cid:9) and assume that P (cid:0) X ∈ B (cid:1) = 1 and that n ≥ k and k ≥ . For any δ ∈ ]0 , , with proba-bility at least − δ , sup c ∈ B k (cid:0) P X − P X (cid:1)(cid:16) min j ∈ J ,k K (cid:0) k c j k − h c j , X i (cid:1)(cid:17) ≤ B log (cid:16) nk (cid:17)r k log( k ) n √ | {z } ≤ . + 6log( n/k )+ 1log( n/k ) s (cid:0) √ (cid:1)(cid:0)
17 + 9 log( k ) (cid:1) log( k ) ! + 2 B r δ − ) n ≤ B log (cid:16) nk (cid:17)r k log( k ) n + 2 B r δ − ) n . Concerning the excess risk, for any c ∗ ∈ B k , with probability at least − δ , sup c ∈ B k (cid:0) P X − P X (cid:1)h(cid:16) min j ∈ J ,k K k X − c j k (cid:17) − (cid:16) min j ∈ J ,k K k X − c ∗ j k (cid:17)i ≤ B log (cid:16) nk (cid:17)r k log( k ) n + 4 B r δ − ) n . EW BOUNDS FOR K -MEANS Consequently, for any ε ≥ , for any ε -minimizer b c , that is for any b c ∈ B k depending on theobserved sample and satisfying P X (cid:16) min j ∈ J ,k K k X − b c j k (cid:17) ≤ inf c ∈ H k P X (cid:16) min j ∈ J ,k K k X − c j k (cid:17) + ε, for any δ ∈ ]0 , , with probability at least − δ , P X (cid:16) min j ∈ J ,k K k X − b c j k (cid:17) ≤ inf c ∈ H k P X (cid:16) k X − c j k (cid:17) + 16 B log (cid:16) nk (cid:17)r k log( k ) n + 4 B r δ − ) n + ε. Moreover, we also have a bound in expectation with respect to the statistical sample distri-bution: P X ,...,X n h P X (cid:16) min j ∈ J ,k K k X − b c j k (cid:17)i ≤ inf c ∈ H k P X (cid:16) k X − c j k (cid:17) + 16 B log (cid:16) nk (cid:17)r k log( k ) n + ε. The general meaning of this proposition is that a chaining argument yields a dimensionfree non asymptotic generalization bound that decreases as p k/n up to logarithmic factors.P ROOF . We choose to work with the risk function min j ∈ J ,k K (cid:0) k c j k − h c j , X i (cid:1) = min j ∈ J ,k K (cid:0) k X − c j k (cid:1) − k X k because this provides slightly better constants. Introduce W = ( − X, γB ) ∈ H × R and θ j = ( c j , γ − k c j k B − ) , where the parameter γ > will be optimized later on. Remark that k c j k − h c j , X i = h θ j , W i ∈ [ − B , B ] . Note also that k W k k θ j k ≤ B (cid:0) γ (cid:1)(cid:0) γ − (cid:1) = B (cid:0) γ + 4 γ − (cid:1) and optimize the right-hand size, choosing γ = √ , to get k W k k Θ k ≤ kB , where Θ = (cid:8)(cid:0) c j , − / B − k c j k (cid:1) kj =1 ∈ ( H × R ) k : c ∈ B k (cid:9) . The proposition is then a transcription of Lemma 15 on page 19 together with the simplifica-tion(21) min ( , log (cid:16) nk (cid:17)r k log( k ) n √ n/k )+ 1log( n/k ) s (cid:0) √ (cid:1)(cid:0)
17 + 9 log( k ) (cid:1) log( k ) !) ≤
16 log (cid:16) nk (cid:17)r k log( k ) n that holds for any k ≥ and any n ≥ k and can be used since B is a trivial bound. Remarkthat in the three last inequalities of the proposition we can take the infimum on c ∈ H k insteadof c ∈ B k , since it is in fact reached on B k .Thus, all that remains to prove is (21).Putting a = 6 √ , b = 16 , ρ = n/k , η = 6 + s (cid:0) √ (cid:1)(cid:0)
17 + 9 log( k ) (cid:1) log( k ) ,f ( ρ, k ) = p log( k ) /ρ (cid:16) a log( ρ ) + η ( k ) (cid:17) and g ( ρ, k ) = b p log( k ) /ρ log( ρ ) , we have to prove that min (cid:8) , f ( ρ, k ) (cid:9) ≤ g ( ρ, k ) , ρ ≥ , k ≥ . In other words, we have to prove that, when g ( ρ, k ) < f ( ρ, k ) , then g ( ρ, k ) ≥ . This can alsobe written as g ( ρ, k ) ≥ , min { ρ, k } ≥ , g ( ρ, k ) < f ( ρ, k ) . According to the definitions, this is also equivalent to log( ρ ) − (cid:0) log( ρ ) (cid:1) ≤ b/
4) + log (cid:0) log( k ) (cid:1) , min { ρ, k } ≥ , ( b − a ) log( ρ ) ≤ η ( k ) . Since η is decreasing and since k log (cid:0) log( k ) (cid:1) is increasing, if the statement is true for k = 2 , it is true for any k ≥ . Thus we have to prove that log( ρ ) − (cid:0) log( ρ ) (cid:1) ≤ b/
4) + log (cid:0) log(2) (cid:1) , log(2) ≤ log( ρ ) ≤ η (2) / ( b − a ) . Putting ξ = log( ρ ) , we have to prove that ξ − ξ ) ≤ b/
4) + log (cid:0) log(2) (cid:1) , log(2) ≤ ξ ≤ η (2) / ( b − a ) . Since ξ ξ − ξ ) is convex, it is enough to check the inequality at the two ends ofthe interval, that is when ξ ∈ { log(2) , η (2) / ( b − a ) } , which can be done numerically. Moreprecisely, we have to check that b/
4) + log (cid:0) log(2) (cid:1) − max n log(2) − (cid:0) log(2) (cid:1) , η (2) / ( b − a ) − (cid:2) η (2) / ( b − a ) (cid:3)o ≥ , and we get numerically that the left-hand side is larger than the minimum of . and . .
7. Generalization bounds for the robust k -means criterion. P ROPOSITION
Let X be a random vector in a separable Hilbert space H and let ( X , . . . , X n ) be a statistical sample made of n independent copies of X . Consider for somescale parameter σ > the criterion R of equation (5) on page 7 and its empirical counter-part R ( c ) = 2 σ P X h − exp (cid:16) − σ min j ∈ J , k K k X − c j k (cid:17)i , c ∈ H k . EW BOUNDS FOR K -MEANS Consider any k ≥ and any n ≥ k . For any δ ∈ ]0 , , with probability at least − δ , for any c ∈ H k , R ( c ) ≤ R ( c ) + 2 σ log( n/k )log(2) r k log( k ) n + 2 r k log( k ) n + s ( √ k (cid:0) k ) (cid:1) n + r log( δ − )2 n ! . For any non random family of centers c ∗ ∈ H k , with probability at least − δ , for any c ∈ H k , R ( c ) − R ( c ∗ ) ≤ R ( c ) − R ( c ∗ ) + 2 σ log( n/k )log(2) r k log( k ) n + 2 r k log( k ) n + s ( √ k (cid:0) k ) (cid:1) n + r δ − ) n ! . Consequently, for any ε ≥ , if b c is an ε -minimizer satisfying R ( b c ) ≤ inf c ∈ R d × k R ( c ) + ε, with probability at least − δ , R ( b c ) ≤ inf c ∈ R d × k R ( c ) + 2 σ log( n/k )log(2) r k log( k ) n + 2 r k log( k ) n + s ( √ k (cid:0) k ) (cid:1) n + r δ − ) n ! + ε. In the same way, in expectation, P X , ... ,X n (cid:16) R ( b c ) (cid:17) ≤ inf c ∈ R d × k R ( c ) + 2 σ log( n/k )log(2) r k log( k ) n + 2 r k log( k ) n + s ( √ k (cid:0) k ) (cid:1) n ! + ε. As we can see, the robust criterion has a scale parameter σ , that allows to remove all inte-grability conditions on the sample distribution or boundedness assumptions on the centers.P ROOF . According to the Aronszajn theorem [2], there exists a mapping
Ψ : H → H suchthat exp (cid:16) − σ k x − y k (cid:17) = h Ψ( x ) , Ψ( y ) i H , x, y ∈ H. Moreover the reproducing kernel Hilbert space H , being based on a continuous kernel de-fined on a separable topological space, is separable according to [13, lemma 4.33 page 130].We can express the risk as R ( c ) = 2 σ h P W (cid:16) min j ∈ J ,k K h− θ j , W i H (cid:17)i , where θ j = Ψ( c j ) and W = Ψ( X ) . The proof then follows from Lemma 15 on page 19,taking into account that θ j and W belong to the unit ball of H , so that k W k ∞ = 1 and k Θ k = √ k .
8. Generalization bounds for the information k -means criterion. In order to applyLemma 15 on page 19 and obtain a generalization bound, we are going to linearize the infor-mation k -means algorithm presented in section 4 on page 8, using the kernel trick.Let us introduce the separable Hilbert space H = { ( f, x ) ∈ L ( ν ) × R } equipped with theinner-product h h, h ′ i = h h , h ′ i L ( ν ) + µ h h ′ , h = ( h , h ) , h ′ = ( h ′ , h ′ ) ∈ H, where µ > is a positive real parameter to be chosen afterwards. The associated norm is k ( h , h ) k = q h h , h i L ( ν ) + µ h = sZ h d ν + µ h , h = ( h , h ) ∈ H. Define for any constant B ∈ R + Θ B = n(cid:0) q, K ( q, (cid:1) : q ∈ L , ( ν ) ∩ L ( ν ) , R q d ν ≤ B o ⊂ H, this definition being justified by the fact that(22) K ( q,
1) = Z q log( q ) d ν ≤ log (cid:16)Z q d ν (cid:17) < + ∞ whenever R q d ν < + ∞ .L EMMA
Assume that ess sup X Z log( p X ) d ν < ∞ and ess sup X Z p X d ν < ∞ . Re-mark first that the smallest information ball containing the support of P p X has an informa-tion radius inf q ∈ L , ( ν ) ess sup X K ( q, p X ) ≤ ess sup X K (1 , p X )= ess sup X Z log (cid:0) p − X (cid:1) d ν ≤ ess sup X (cid:18)Z log( p X ) d ν (cid:19) / < ∞ . Define B = ess sup X (cid:18)Z p X d ν (cid:19) / exp " inf q ∈ L , ( ν ) ess sup X K ( q, p X ) < ∞ and consider therandom variable W = (cid:0) − log( p X ) , µ − (cid:1) ∈ H. The following two minimization problems are equivalent inf q ∈ (cid:0) L , ( ν ) (cid:1) k P X (cid:16) min j ∈ J ,k K K ( q j , p X ) (cid:17) = inf θ ∈ Θ kB P W (cid:16) min j ∈ J ,k K h θ j , W i H (cid:17) . P ROOF . Let B ′ = ess sup X (cid:18)Z p X d ν (cid:19) / and C = ess sup X (cid:18)Z log( p X ) d ν (cid:19) / . First letus remark that under the hypothesis of the lemma, the information k -means criterion is finite. EW BOUNDS FOR K -MEANS Indeed, inf q ∈ (cid:0) L , ( ν ) (cid:1) k P X (cid:16) min j ∈ J ,k K K (cid:0) q j , p X (cid:1)(cid:17) ≤ P X (cid:16) K (1 , p X ) (cid:17) = P X (cid:18)Z log (cid:0) p − X (cid:1) d ν (cid:19) ≤ C < + ∞ . Now, for any measurable classification function ℓ : X J , k K for which the criterion isfinite, we know from Lemma 11 on page 15 that q ⋆,ℓj ∈ L ( ν ) and we can remark that K (cid:0) q ⋆,ℓj , p X (cid:1) = h θ ⋆,ℓj , W i H where θ ⋆,ℓj = (cid:0) q ⋆,ℓj , K ( q ⋆,ℓj , (cid:1) . Remark that this definition is justified by the fact that K ( q,
1) = Z q log( q ) d ν ≤ log (cid:18)Z q d ν (cid:19) , so that q ⋆,ℓj ∈ L ( ν ) implies that K (cid:0) q ⋆,ℓj , (cid:1) < + ∞ . So, it is sufficient to conclude the proofto show that θ ⋆,ℓj ∈ Θ B . As in the proof of Lemma 11 on page 15, Z ( q ⋆,ℓj ) d ν ≤ Z − j P X | ℓ ( X )= j (cid:18)Z p X d ν | {z } ≤ B ′ (cid:19) ≤ Z − j B ′ , j ∈ J , k K . By Jensen’s inequality, for any j ∈ J , k K , Z j = sup q ∈ L , ( ν ) Z q exp (cid:26) P X | ℓ ( X )= j h log( p X /q ) i(cid:27) d ν ≥ sup q ∈ L , ( ν ) exp (cid:26) P X | ℓ ( X )= j (cid:20)Z q log( p X /q ) d ν (cid:21)(cid:27) = exp n − inf q ∈ L , ( ν ) P X | ℓ ( X )= j h K (cid:0) q, p X (cid:1)io . Hence Z − j ≤ exp n inf q ∈ L , ( ν ) ess sup X K ( q, p X ) o ≤ exp( C ) . Therefore (cid:18)Z ( q ⋆,ℓj ) d ν (cid:19) / ≤ B ′ exp h inf q ∈ L , ( ν ) ess sup X K ( q, p X ) i = B ≤ B ′ exp( C ) < ∞ , proving that B < ∞ and that θ ⋆,ℓj = (cid:0) q ⋆,ℓj , K ( q ⋆,ℓj , (cid:1) ∈ Θ B , which concludes the proof.P ROPOSITION
Under the hypotheses of the previous lemma there exists an optimalquantizer θ ⋆ ∈ Θ kB minimizing the k -means risk, that is such that E (cid:16) min j ∈ J ,k K h θ ⋆j , W i (cid:17) = inf θ ∈ Θ kB E (cid:16) min j ∈ J ,k K h θ j , W i (cid:17) . P ROOF . Note that k Θ B k = sup θ ∈ Θ B k θ k ≤ p B + µ log( B ) < + ∞ , according to equation (22) on page 30. Therefore Θ B is bounded. Applying Proposition13 on page 16 to Θ kB , we find e θ ∈ Θ kB , the weak closure of Θ kB , such that R ( e θ ) def = P W (cid:16) min j ∈ J , k K h e θ j , W i (cid:17) = inf θ ∈ Θ kB R ( θ ) . Remark now that, since, according to the Donsker Varadhan representation, K ( q,
1) = sup h ∈ L ( ν ) Z hq d ν − log (cid:18)Z exp( h ) d ν (cid:19) , the function q K ( q, defined on L ( ν ) ∩ L , ( ν ) is weakly lower semicontinuous. In-deed, it is a supremum of weakly continuous function. Accordingly, its epigraph is weaklyclosed. As Θ B belongs to this epigraph, its weak closer also belongs to it. This implies thatfor each j ∈ J , k K , e θ j belongs to it, so that e θ = (cid:0) ( q j , y j ) , j ∈ J , k K (cid:1) , where y j ≥ K ( q j , .Indeed the weak closure of Θ kB is the product Θ kB of k times the weak closure of Θ B . Letus put θ ∗ = (cid:16)(cid:0) q j , K ( q j , (cid:1) , j ∈ J , k K (cid:17) . By monotonicity of R with respect to y j , the corre-sponding coefficient of W being positive, inf θ ∈ Θ kB R ( θ ) = R ( e θ ) ≥ R ( θ ∗ ) . Since θ ∗ ∈ Θ kB , the reverse inequality also holds and R ( θ ∗ ) = inf θ ∈ Θ kB R ( θ ) .The link we just made between the information k -means criterion and the linear k -meanscriterion allows us to apply Lemma 15 on page 19, proving the next proposition.P ROPOSITION
Assume that ess sup X (cid:18)Z p X d ν (cid:19) < + ∞ and ess sup X (cid:18)Z log( p X ) d ν (cid:19) < + ∞ . Consider the information radius R = inf q ∈ L , ( ν ) ess sup X K (cid:0) q, p X (cid:1) and the bounds B = ess sup X (cid:18)Z p X d ν (cid:19) / exp( R ) and C = ess sup X (cid:18)Z log( p X ) d ν (cid:19) / . Introduce the parameter space Q B = n q ∈ L , ( ν ) ∩ L ( ν ) : R q d ν ≤ B o . EW BOUNDS FOR K -MEANS Given ( X , . . . , X n ) , a sample made of n independent copies of X , with probability at least − δ , for any q ∈ Q kB , (cid:0) P X − P X (cid:1)(cid:16) min j ∈ J ,k K K ( q j , p X ) (cid:17) ≤ log( n/k )log(2) r k log( k ) n + 2 r k log( k ) n + s ( √ k (cid:0) k ) (cid:1) n + r log( δ − )2 n !(cid:0) BC + 2 log( B ) (cid:1) . For some ε ≥ , consider an empirical ε -minimizer b q ( X , . . . , X n ) ∈ Q kB satisfying P X (cid:16) min j ∈ J ,k K K (cid:0)b q j , p X (cid:1)(cid:17) ≤ inf q ∈ Q kB P X (cid:16) min j ∈ J ,k K K (cid:0) q j , p X (cid:1)(cid:17) + ε. For any δ ∈ ]0 , , with probability at least − δ , P X (cid:16) min j ∈ J ,k K K (cid:0)b q, p X (cid:1)(cid:17) ≤ inf q ∈ (cid:0) L , ( ν ) (cid:1) k P X (cid:16) min j ∈ J ,k K K (cid:0) q j , p X (cid:1)(cid:17) + log( n/k )log(2) r k log( k ) n + 2 r k log( k ) n + s ( √ k (cid:0) k ) (cid:1) n + r δ − ) n !(cid:0) BC + 2 log( B ) (cid:1) + ε. Moreover, in expectation, P X ,...,X n h P X (cid:16) min j ∈ J ,k K K (cid:0)b q, p X (cid:1)(cid:17)i ≤ inf q ∈ (cid:0) L , ( ν ) (cid:1) k P X (cid:16) min j ∈ J ,k K K (cid:0) q j , p X (cid:1)(cid:17) + log( n/k )log(2) r k log( k ) n + 2 r k log( k ) n + s ( √ k (cid:0) k ) (cid:1) n !(cid:0) BC + 2 log( B ) (cid:1) + ε. P ROOF . Apply Lemma 20 on page 30. Note that, choosing µ = BC − log( B ) − , we get k Θ kB k k W k ∞ ≤ k (cid:0) B + µ log( B ) (cid:1)(cid:0) C + µ − (cid:1) = k (cid:0) BC + 2 log( B ) (cid:1) . Remark also that for any θ ∈ Θ kB , with probability one, min j ∈ J ,k K h θ j , W i ∈ (cid:2) , k Θ B kk W k ∞ (cid:3) ⊂ (cid:2) , BC + 2 log( B ) (cid:3) . Use these bounds in Lemma 15 on page 19 to conclude the proof.
Acknowledgements.
We are grateful to Nikita Zhivotovskiy for useful comments andreferences. REFERENCES [1] A
NTOS , A. (2005). Improved minimax bounds on the test and training distortion of empirically designedvector quantizers.
IEEE Transactions on Information Theory RONSZAJN , N. (1950). Theory of reproducing kernels.
Transactions of the American mathematical society ANERJEE , A., D
HILLON , I. and G
HOSH , J. (2004). Clustering with Bregman Divergences.
Journal ofMachine Learning Research .[4] B ARTLETT , P. L., L
INDER , T. and L
UGOSI , G. (1998). The minimax distortion redundancy in empiricalquantizer design.
IEEE Transactions on Information theory EN -T AL , A., C HARNES , A. and T
EBOULLE , M. (1989). Entropic means.
Journal of Mathematical Anal-ysis and Applications
IAU , G., D
EVROYE , L. and L
UGOSI , G. (2008). On the performance of clustering in Hilbert spaces.
IEEETransactions on Information Theory OUCHERON , S., L
UGOSI , G. and M
ASSART , P. (2013).
Concentration inequalities : a non asymptotictheory of independence . Oxford University Press.[8] C AO , J., W U , Z., W U , J. and L IU , W. (2013). Towards information-theoretic K-means clustering for imageindexing. Signal Processing ATONI , O. (2004).
Statistical learning theory and stochastic optimization. Ecole d’été de probabilités deSaint-Flour XXXI-2001 . Springer Collection : Lecture notes in mathematics n°1851.[10] C
ATONI , O. (2012). Challenging the empirical mean and empirical variance: A deviation study.
Ann. Inst.H. Poincaré Probab. Statist. ATONI , O. and G
IULINI , I. (2017). Dimension free PAC-Bayesian bounds for the estimation of the meanof a random vector. In the Nips 2017 Workshop : (Almost) 50 shades of Bayesian learning : PAC-Bayesian trends and insights
ATONI , O. and G
IULINI , I. (2017). Dimension-free PAC-Bayesian bounds for matrices, vectors, and linearleast squares regression. arXiv preprint arXiv:1712.02747 .[13] C
HRISTMANN , A. and S
TEINWART , I. (2008).
Support Vector Machines .[14] D
HILLON , I. S., M
ALLELA , S. and K
UMAR , R. (2003). A divisive information-theoretic feature clusteringalgorithm for text classification.
Journal of machine learning research EFFERMAN , C., M
ITTER , S. and N
ARAYANAN , H. (2016). Testing the manifold hypothesis.
Journal ofthe American Mathematical Society ISCHER , A. (2010). Quantization and clustering with Bregman divergences.
Journal of Multivariate Anal-ysis
ISCHER , A., L
EVRARD , C. and B
RÉCHETEAU , C. (2020). Robust Bregman Clustering.
Annals of Statis-tics .[18] F
OSTER , D. J. and R
AKHLIN , A. (2019). ℓ ∞ Vector Contraction for Rademacher Complexity. arXivpreprint arXiv:1911.06468 .[19] G
IULINI , I. (2015). Generalization bounds for random samples in Hilbert spaces, Theses, Ecole normalesupérieure - ENS PARIS.[20] J
IANG , B., P EI , J., T AO , Y. and L IN , X. (2011). Clustering uncertain data based on probability distributionsimilarity. IEEE Transactions on Knowledge and Data Engineering LOCHKOV , Y., K
ROSHNIN , A. and Z
HIVOTOVSKIY , N. (2020). Robust k -means Clustering for Distribu-tions with Two Moments. Annals of Statistics (forthcoming) .[22] L
EVRARD , C. (2013). Fast rates for empirical vector quantization.
Electron. J. Statist. EVRARD , C. (2014). High-dimensional vector quantization : convergence rates and variable selection,Theses, Université Paris Sud - Paris XI.[24] L
EVRARD , C. (2015). Nonasymptotic bounds for vector quantization in Hilbert spaces.
Ann. Statist. EVRARD , C. (2018). Quantization/Clustering: when and why does k-means work?
Journal de la SociétéFrançaise de Statistique
ASSART , P. (2007). Concentration inequalities and model selection. In
École d’été de probabilités deSaint-Flour (J. P
ICARD , ed.).[27] N
IELSEN , F. (2013). Jeffreys centroids: A closed-form expression for positive histograms and a guaranteedtight approximation for frequency histograms.
IEEE Signal Processing Letters IELSEN , F. (2019). On the Jensen–Shannon symmetrization of distances relying on abstract means.
En-tropy IELSEN , F. and N
OCK , R. (2009). Sided and symmetrized Bregman centroids.
IEEE transactions onInformation Theory K -MEANS [30] N IELSEN , F., N
OCK , R. and A
MARI , S.- I . (2014). On clustering histograms with k-means by using mixed α -divergences. Entropy EREIRA , F., T
ISHBY , N. and L EE , L. (2002). Distributional Clustering Of English Words. Proceedings ofthe 31st Annual Meeting on Association for Computational Linguistics .[32] S
LONIM , N. and T
ISHBY , N. (1999). Agglomerative information bottleneck. In
Proceedings of the 12thInternational Conference on Neural Information Processing Systems
ISHBY , N., P
EREIRA , F. and B
IALEK , W. (2001). The Information Bottleneck Method.
Proceedings ofthe 37th Allerton Conference on Communication, Control and Computation .[34] V ELDHUIS , R. (2002). The centroid of the symmetrical Kullback-Leibler distance.