[PDF] On the symmetrical Kullback-Leibler Jeffreys centroids

Abstract

Full PDF

OOn the symmetrical Kullback-Leibler Jeﬀreys centroids

Frank Nielsen ∗ Sony Computer Science Laboratories, Inc.3-14-13 Higashi Gotanda, 141-0022 Shinagawa-ku, Tokyo, Japan

Abstract

Due to the success of the bag-of-word modeling paradigm, clustering histogramshas become an important ingredient of modern information processing. Clusteringhistograms can be performed using the celebrated k -means centroid-based algorithm.From the viewpoint of applications, it is usually required to deal with symmetricdistances. In this letter, we consider the Jeﬀreys divergence that symmetrizes theKullback-Leibler divergence, and investigate the computation of Jeﬀreys centroids. Weﬁrst prove that the Jeﬀreys centroid can be expressed analytically using the Lambert W function for positive histograms. We then show how to obtain a fast guaranteed ap-proximation when dealing with frequency histograms. Finally, we conclude with someremarks on the k -means histogram clustering. Classifying documents into categories is a common task of information retrieval systems:Given a training set of documents labeled with categories, one asks to classify incomingnew documents. Text categorization [2] proceeds by ﬁrst deﬁning a dictionary of words (thecorpus). It then models each document by a word count yielding a word histogram perdocument. Deﬁning a proper distance d ( · , · ) between histograms allows one to: • Classify a new on-line document: we ﬁrst calculate its histogram signature and thenseek for the labeled document which has the most similar histogram to deduce its tag( e.g. , using a nearest neighbor classiﬁer). • Find the initial set of categories: we cluster all document histograms and assign acategory per cluster. ∗ A ﬁrst version of this paper appeared in IEEE Signal Processing Letters [1]. This revision includes a fasterﬁxed-point technique in Section 3.3 and the R source code. (5793b870) a r X i v : . [ c s . I T ] J a n t has been shown experimentally that the Jeﬀreys divergence (symmetrizing theKullback-Leibler divergence) achieves better performance than the traditional tf-idf method [2]. This text classiﬁcation method based on the Bag of Words (BoWs) repre-sentation has also been instrumental in computer vision for eﬃcient object categorization [3]and recognition in natural images. It ﬁrst requires to create a dictionary of “visual words”by quantizing keypoints of the training database. Quantization is then performed using the k -means algorithm [9] that partitions n data points X = { x , ..., x n } into k clusters C , ..., C k where each data element belongs to the closest cluster center. From a given initialization,Lloyd’s batched k -means ﬁrst assigns points to their closest cluster centers, and then updatethe cluster centers, and reiterate this process until convergence is met after a ﬁnite numberof steps. When the distance function d ( x, y ) is chosen as the squared Euclidean distance d ( x, y ) = (cid:107) x − y (cid:107) , the cluster centroids updates to their centers of mass. Csurka et al. [3]used the squared Euclidean distance for building the visual vocabulary. Depending on theconsidered features, other distances have proven useful: For example, the Jeﬀreys diver-gence was shown to perform experimentally better than the Euclidean or squared Euclideandistances for Compressed Histogram of Gradient descriptors [4]. To summarize, k -means his-togram clustering with respect to the Jeﬀreys divergence can be used to both quantize visualwords to create a dictionary and to cluster document words for assigning initial categories.Let w h = (cid:80) di =1 h i denote the cumulative sum of the bin values of histogram h . We dis-tinguish between positive histograms and frequency histograms . A frequency histogram ˜ h isa unit histogram ( i.e. , the cumulative sum w ˜ h of its bins adds up to one). In statistics, thosepositive and frequency histograms correspond respectively to positive discrete and multino-mial distributions when all bins are non-empty. Let H = { h , ..., h n } be a collection of n histograms with d positive-valued bins. By notational convention, we use the superscript andthe subscript to indicate the bin number and the histogram number, respectively. Withoutloss of generality, we assume that all bins are non-empty : h ij ≥ , ≤ j ≤ n, ≤ i ≤ d .To measure the distance between two such histograms p and q , we rely on the relativeentropy . The extended KL divergence [9] between two positive (but not necessarily nor-malized) histograms p and q is deﬁned by KL( p : q ) = (cid:80) di =1 p i log p i q i + q i − p i . Ob-serve that this information-theoretic dissimilarity measure is not symmetric nor does itsatisfy the triangular inequality property of metrics. Let ˜ p = p (cid:80) di =1 p i and ˜ q = q (cid:80) di =1 q i denote the corresponding normalized frequency histograms. In the remainder, the ˜ de-notes this normalization operator. The extended KL divergence formula applied to nor-malized histograms yields the traditional KL divergence [9]: KL(˜ p : ˜ q ) = (cid:80) di =1 ˜ p i log ˜ p i ˜ q i since (cid:80) di =1 ˜ q i − ˜ p i = (cid:80) di =1 ˜ q i − (cid:80) di =1 ˜ p i = 1 − relativeentropy between ˜ p and ˜ q : KL(˜ p : ˜ q ) = H × (˜ p : ˜ q ) − H (˜ p ), where H × (˜ p : ˜ q ) = (cid:80) di =1 ˜ p i log q i denotes the cross-entropy and H (˜ p ) = H × (˜ p : ˜ p ) = (cid:80) di =1 ˜ p i log p i is the Shannon entropy .This distance is explained as the expected extra number of bits per datum that must betransmitted when using the “wrong” distribution ˜ q instead of the true distribution ˜ p . Often Otherwise, we may add an arbitrary small quantity (cid:15) > p is hidden by nature and need to be hypothesized while ˜ q is estimated. When clusteringhistograms, all histograms play the same role , and it is therefore better to consider theJeﬀreys [5] divergence J ( p, q ) = KL( p : q ) + KL( q : p ) that symmetrizes the KL divergence: J ( p, q ) = d (cid:88) i =1 ( p i − q i ) log p i q i = J ( q, p ) . (1)Observe that the formula for Jeﬀreys divergence holds for arbitrary positive histograms(including frequency histograms).This letter is devoted to compute eﬃciently the Jeﬀreys centroid c of a set H = { h , ..., h n } of weighted histograms deﬁned as: c = arg min x n (cid:88) j =1 π j J ( h j , x ) , (2)where the π j ’s denote the histogram positive weights (with (cid:80) nj =1 π j = 1). When all his-tograms h j ∈ H are normalized, we require the minimization of x to be carried out over ∆ d ,the ( d − Jeﬀreys frequency centroid ˜ c .Otherwise, for positive histograms h j ∈ H , the minimization of x is done over the positiveorthant R d + , to get the Jeﬀreys positive centroid c . Since the J -divergence is convex in botharguments, both the Jeﬀreys positive and frequency centroids are unique. On one hand, clustering histograms has been studied using various distances and cluster-ing algorithms. Besides the classical Minkowski (cid:96) p -norm distances, hierarchical clusteringwith respect to the χ distance has been investigated in [8]. Banerjee et al. [9] generalized k -means to Bregman k -means thus allowing to cluster distributions of the same exponentialfamilies with respect to the KL divergence. Mignotte [10] used k -means with respect to theBhattacharyya distance [11] on histograms of various color spaces to perform image segmen-tation. On the other hand, Jeﬀreys k -means has not been yet extensively studied as thecentroid computations are non-trivial: In 2002, Veldhuis [12] reported an iterative Newton-like algorithm to approximate arbitrarily ﬁnely the Jeﬀreys frequency centroid ˜ c of a setof frequency histograms that requires two nested loops. Nielsen and Nock [13] consideredthe information-geometric structure of the manifold of multinomials (frequency histograms)to report a simple geodesic bisection search algorithm ( i.e. , replacing the two nested loopsof [12] by one single loop). Indeed, the family of frequency histograms belongs to the expo-nential families [9], and the Jeﬀreys frequency centroid amount to compute equivalently asymmetrized Bregman centroid [13].To overcome the explicit computation of the Jeﬀreys centroid, Nock et al. [14] generalizedthe Bregman k -means [9] and k -means++ seeding using mixed Bregman divergences : Theyconsider two dual centroids ˜ c m and ˜ c ∗ m attached per cluster, and use the following divergencedepending on these two centers: ∆KL(˜ c m : x : ˜ c ∗ m ) = KL(˜ c m : x ) + KL( x : ˜ c ∗ m ). However,3ote that this mixed Bregman 2-centroid-per-cluster clustering is diﬀerent from the Jeﬀreys k -means clustering that relies on one centroid per cluster.This letter is organized as follows: Section 2 reports a closed-form expression of thepositive Jeﬀreys centroid for a set of positive histograms. Section 3 studies the guaranteedtight approximation factor obtained when normalizing the positive Jeﬀreys centroid, andfurther describes a simple bisection algorithm to arbitrarily ﬁnely approximate the optimalJeﬀreys frequency centroid. Section 4 reports on our experimental results that show that ournormalized approximation is in practice tight enough to avoid doing the bisection process.Finally, Section 5 concludes this work. We consider a set H = { h , ..., h n } of n positive weighted histograms with d non-empty bins( h j ∈ R d + , π j > (cid:80) j π j = 1). The Jeﬀreys positive centroid c is deﬁned by: c = arg min x ∈ R d + J ( H , x ) = arg min x ∈ R d + n (cid:88) j =1 π j J ( h j , x ) . (3)We state the ﬁrst result: Theorem 1

The Jeﬀreys positive centroid c = ( c , ..., c d ) of a set { h , ..., h n } of n weightedpositive histograms with d bins can be calculated component-wise exactly using the Lambert W analytic function: c i = a i W ( aigi e ) , where a i = (cid:80) nj =1 π j h ij denotes the coordinate-wise arithmeticweighted means and g i = (cid:81) nj =1 ( h ij ) π j the coordinate-wise geometric weighted means. Proof

We seek for x ∈ R d + that minimizes Eq. 3. After expanding Jeﬀreys divergenceformula of Eq. 1 in Eq. 3 and removing all additive terms independent of x , we ﬁnd thefollowing equivalent minimization problem:min x ∈ R d + d (cid:88) i =1 x i log x i g i − a i log x i . This optimization can be performed coordinate-wise, independently. For each coordinate,dropping the superscript notation and setting the derivative to zero, we have to solve log xg +1 − ax = 0, which yields x = aW ( ag e ) , where W ( · ) denotes the Lambert W function [15].Lambert function W is deﬁned by W ( x ) e W ( x ) = x for x ≥

0. That is, the Lambertfunction is the functional inverse of f ( x ) = xe x = y : x = W ( y ). Although function W mayseem non-trivial at ﬁrst sight, it is a popular elementary analytic function similar to thelogarithm or exponential functions. In practice, we get a fourth-order convergence algorithmto estimate it by implementing Halley’s numerical root-ﬁnding method. It requires fewerthan 5 iterations to reach machine accuracy using the IEEE 754 ﬂoating point standard [15].Notice that the Lambert W function plays a particular role in information theory [16]. We consider only the branch W [15] since arguments of the function are always positive. Jeﬀreys frequency centroid

We consider a set ˜ H of n frequency histograms: ˜ H = { ˜ h , ..., ˜ h n } . If we relax x to the positive orthant R d + instead of the probability simplex, we get the optimalpositive Jeﬀreys centroid c , with c i = a i W ( aigi e ) (Theorem 1). Normalizing this positive Jeﬀreyscentroid to get ˜ c (cid:48) = cw c does not yield the Jeﬀreys frequency centroid ˜ c that requires dedicatediterative optimization algorithms [12, 13]. In this section, we consider approximations of theJeﬀreys frequency histograms. Veldhuis [12] approximated the Jeﬀreys frequency centroid˜ c by ˜ c (cid:48)(cid:48) = ˜ a +˜ g , where ˜ a and ˜ g denotes the normalized weighted arithmetic and geometricmeans, respectively. The normalized geometric weighted mean ˜ g = (˜ g , ..., ˜ g d ) is deﬁned by˜ g i = (cid:81) nj =1 (˜ h ij ) πj (cid:80) di =1 (cid:81) nj =1 (˜ h ij ) πj , i ∈ { , ..., d } . Since (cid:80) di =1 (cid:80) nj =1 π j ˜ h ij = (cid:80) nj =1 π j (cid:80) di =1 ˜ h ij = (cid:80) nj =1 π j = 1,the normalized arithmetic weighted mean has coordinates: ˜ a i = (cid:80) nj =1 π j ˜ h ij .We consider approximating the Jeﬀreys frequency centroid by normalizing the Jeﬀreyspositive centroid c : ˜ c (cid:48) = cw c .We start with a simple lemma: Lemma 1

The cumulative sum w c of the bin values of the Jeﬀreys positive centroid c of aset of frequency histograms is less or equal to one: < w c ≤ . Proof

Consider the frequency histograms ˜ H as positive histograms. It follows from Theo-rem 1 that the Jeﬀreys positive centroid c is such that w c = (cid:80) di =1 c i = (cid:80) di =1 a i W ( aigi e ) . Now,the arithmetic-geometric mean inequality states that a i ≥ g i where a i and g i denotes thecoordinates of the arithmetic and geometric positive means. Therefore W ( a i g i e ) ≥ c i ≤ a i . Thus w c = (cid:80) di =1 c i ≤ (cid:80) di =1 a i = 1.We consider approximating Jeﬀreys frequency centroid on the probability simplex ∆ d byusing the normalization of the Jeﬀreys positive centroid: ˜ c (cid:48) = a i wW ( aigi e ) , with w = (cid:80) di =1 a i W ( aigi e ) .To study the quality of this approximation, we use the following lemma: Lemma 2

For any histogram x and frequency histogram ˜ h , we have J ( x, ˜ h ) = J (˜ x, ˜ h ) +( w x − x : ˜ h ) + log w x ) , where w x denotes the normalization factor ( w x = (cid:80) di =1 x i ). Proof

It follows from the deﬁnition of Jeﬀreys divergence and the fact that x i = w x ˜ x i that J ( x, ˜ h ) = (cid:80) di =1 ( w x ˜ x i − ˜ h i ) log w x ˜ x i ˜ h i . Expanding and mathematically rewriting the rhs. yields J ( x, ˜ h ) = (cid:80) di =1 ( w x ˜ x i log ˜ x i ˜ h i + w x ˜ x i log w x + ˜ h i log ˜ h i ˜ x i − ˜ h i log w x ) = ( w x −

1) log w x + J (˜ x, ˜ h ) +( w x − (cid:80) di =1 ˜ x i log ˜ x i ˜ h i = J (˜ x, ˜ h ) + ( w x − x : ˜ h ) + log w x ), since (cid:80) di =1 ˜ h i = (cid:80) di =1 ˜ x i = 1.5he lemma can be extended to a set of weighted frequency histograms ˜ H : J ( x, ˜ H ) = J (˜ x, ˜ H ) + ( w x − x : ˜ H ) + log w x ) , where J ( x, ˜ H ) = (cid:80) nj =1 π j J ( x, ˜ h j ) and KL(˜ x : ˜ H ) = (cid:80) nj =1 π j KL(˜ x, ˜ h j ) (with (cid:80) nj =1 π j = 1).We state the second theorem concerning our guaranteed approximation: Theorem 2

Let ˜ c denote the Jeﬀreys frequency centroid and ˜ c (cid:48) = cw c the normalized Jeﬀreyspositive centroid. Then the approximation factor α ˜ c (cid:48) = J (˜ c (cid:48) , ˜ H ) J (˜ c, ˜ H ) is such that ≤ α ˜ c (cid:48) ≤ w c − KL( c, ˜ H ) J ( c, ˜ H ) ≤ w c (with w c ≤ ). Proof

We have J ( c, ˜ H ) ≤ J (˜ c, ˜ H ) ≤ J (˜ c (cid:48) , ˜ H ). Using Lemma 2, since J (˜ c (cid:48) , ˜ H ) = J ( c, ˜ H ) + (1 − w c )(KL(˜ c (cid:48) , ˜ H ) + log w c )) and J ( c, ˜ H ) ≤ J (˜ c, ˜ H ), it follows that 1 ≤ α ˜ c (cid:48) ≤ (1 − w c )(KL(˜ c (cid:48) , ˜ H )+log w c ) J (˜ c, ˜ H ) . We also have KL(˜ c (cid:48) : ˜ H ) = w c KL( c, ˜ H ) − log w c (by expanding theKL expression and using the fact that w c = (cid:80) i c i ). Therefore α ˜ c (cid:48) ≤ (1 − w c )KL( c, ˜ H ) w c J (˜ c, ˜ H ) . Since J (˜ c, ˜ H ) ≥ J ( c, ˜ H ) and KL( c, ˜ H ) ≤ J ( c, ˜ H ), we ﬁnally obtain α ˜ c (cid:48) ≤ w c .When w c = 1 the bound is tight. Experimental results described in the next section showsthat this normalized Jeﬀreys positive centroid ˜ c (cid:48) almost coincide with the Jeﬀreys frequencycentroid. It has been shown in [12, 13] that minimizing Eq. 2 over the probability simplex ∆ d amountsto minimize the following equivalent problem:˜ c = arg min ˜ x ∈ ∆ d KL(˜ a : ˜ x ) + KL(˜ x : ˜ g ) , (4)Nevertheless, instead of using the two-nested loops of Veldhuis’ Newton scheme [12], we candesign a single loop optimization algorithm. We consider the Lagrangian function obtainedby enforcing the normalization constraint (cid:80) i c i = 1 similar to [12]. For each coordinate,setting the derivative with respect to ˜ c i , we get log ˜ c i ˜ g i + 1 − ˜ a i ˜ c i + λ = 0, which solves as˜ c i = ˜ a i W (cid:16) ˜ aieλ +1˜ gi (cid:17) . By multiplying these d constraints with ˜ c i respectively and summing up,we deduce that λ = − KL(˜ c : ˜ g ) ≤ c i ’s should be less than one, we bound λ as follows: c i = ˜ a i W (cid:16) ˜ aieλ +1˜ gi (cid:17) ≤

1, which solvesfor equality when λ = log( e ˜ a i ˜ g i ) −

1. Thus we seek for λ ∈ [max i log( e ˜ a i ˜ g i ) − , s = (cid:80) i c i = 1, we have the following cumulative sum equation depending on the unknownparameter λ : s ( λ ) = (cid:80) i c i ( λ ) = (cid:80) di =1 ˜ a i W (cid:16) ˜ aieλ +1˜ gi (cid:17) . This is a monotonously decreasing functionwith s (0) ≤

1. We can thus perform a simple bisection search to approximate the optimalvalue of λ , and therefore deduce an arbitrary ﬁne approximation of the Jeﬀreys frequencycentroid. 6 c (opt . positive) α ˜ c (cid:48) (n (cid:48) lized approx . ) w c ≤ (cid:48) lizing coeﬀ . ) α ˜ c (cid:48)(cid:48) (Veldhuis)avg 0 . . . . . . . . . . . . Table 1: Experimental performance ratio and statistics for the 30000+ images of the Caltech-256 database. Observe that α c = J ( H ,c ) J ( H , ˜ c ) ≤ c (cid:48) is almost optimal. Veldhuis’ simple half normalized arithmetic-geometricapproximation performs on average with a 6 .

56% error but can be far from the optimal inthe worst-case (35 . Instead of performing the dichotomic search on λ that yields an approximation scheme with linear convergence , we rather consider the ﬁxed-point equation: λ ∗ = − KL(˜ c ( λ ∗ ) : ˜ g ) , (5)˜ c i ( λ ∗ ) = ˜ a i W (cid:16) ˜ a i e λ +1 ˜ g i (cid:17) , ∀ i ∈ { , ..., d } . (6)Notice that Eq. 5 is a ﬁxed-point equation x = g ( x ) with g ( x ) = − KL(˜ c ( x ) : ˜ g ) in λ fromwhich the Jeﬀreys frequency centroid is recovered: ˜ c = ˜ c ( λ ∗ ). Therefore, we consider a ﬁxed-point iteration scheme: Initially, let ˜ c = ˜ a (the arithmetic mean) and λ = − KL(˜ c : ˜ g ).Then we iteratively update ˜ c and λ for l = 1 , , ... as follows:˜ c il = ˜ a i W (cid:16) ˜ a i e λl − ˜ g i (cid:17) , (7) λ l = − KL(˜ c l : ˜ g ) , (8)and reiterate until convergence, up to machine precision, is reached. We experimentallyobserved faster convergence than the dichotomic search (see next section).In general, existence, uniqueness and convergence rate of ﬁxed-point iteration schemes x = g ( x ) are well-studied (see [19], Banach contraction principle, Brouwer’s ﬁxed pointtheorem, etc.). Figure 1 displays the positive and frequency histogram centroids of two renown image his-tograms. In order to carry out quantitative experiments, we used a multi-precision ﬂoating7a) (b) (c)(d)Figure 1: Two grey-valued images: (a)

Barbara and (b)

Lena , with their frequency intensityhistograms (c) in red and blue, respectively. (d) The exact positive (blue) and frequencyapproximated Jeﬀreys (red) centroids of those two histograms.8oint ( ) package to handle calculations and control arbitrary pre-cisions. A R code is also provided in the Appendix. We chose the Caltech-256 database [17]consisting of 30607 images labeled into 256 categories to perform experiments: We considerthe set of intensity histograms H . For each of the 256 category, we consider the set ofhistograms falling inside this category and compute the exact Jeﬀreys positive centroids c ,its normalization Jeﬀreys approximation ˜ c (cid:48) and optimal frequency centroids ˜ c . We also con-sider the average of the arithmetic and geometric normalized means ˜ c (cid:48)(cid:48) = ˜ a +˜ g . We evaluatethe average, minimum and maximum ratio α x = J ( H ,x ) J ( H , ˜ c ) for x ∈ { c, ˜ c (cid:48) , ˜ c (cid:48)(cid:48) } . The results arereported in Table 1. Furthermore, to study the best/worst/average performance of the thenormalized Jeﬀreys positive centroid ˜ c (cid:48) , we ran 10 trials as follows: We draw two randombinary histograms ( d = 2), calculate a ﬁne precision approximation of ˜ c using numerical op-timization, and calculate the approximation obtained by using the normalized closed-formcentroid ˜ c (cid:48) . We gather statistics on the ratio α = J (˜ c (cid:48) ) J (˜ c ) ≥

1. We ﬁnd experimentally thefollowing performance: ¯ α ∼ . , α max ∼ . , α min = 1 . c (cid:48) isalmost matching ˜ c in those two real-world and synthetic experiments, it remains open toexpress analytically and exactly its worst-case performance.We compare experimentally the convergence rate of the dichotomic search with the ﬁxed-point iterative scheme: In the R implementation reported in the Appendix, we set the pre-cision to .Machine$double.xmin (about 2 . − ). The double preci-sion ﬂoating point number has 52 binary digits: This coincides with the number of iterationsin the bisection search method. We observe experimentally that the ﬁxed-point iterationscheme requires on average 5 to 7 iterations. We summarize the two main contributions of this paper: (1) we proved that the Jeﬀreys pos-itive centroid admits a closed-form formula expressed using the Lambert W function, and (2)we proved that normalizing this Jeﬀreys positive centroid yields a tight guaranteed approx-imation to the Jeﬀreys frequency centroid. We noticed experimentally that the closed-formnormalized Jeﬀreys positive centroid almost coincide with the Jeﬀreys frequency centroid,and can therefore be used in Jeﬀreys k -means clustering. Notice that since the k -means as-signment/relocate algorithm monotonically converges even if instead of computing the exactcluster centroids we update it with provably better centroids (i.e., by applying one bisectioniteration of Jeﬀreys frequency centroid computation or one iteration of the ﬁxed-point al-gorithm), we end up with a converging variational Jeﬀreys frequency k -means that requiresto implement a stopping criterion. Jeﬀreys divergence is not the only way to symmetrizethe Kullback-Leibler divergence. Other KL symmetrizations include the Jensen-Shannondivergence [6], the Chernoﬀ divergence [7], and a smooth family of symmetric divergencesincluding the Jensen-Shannon and Jeﬀreys divergences [18]. Converting RGB color pixels to 0 . R + 0 . G + 0 . B I grey pixels. eferences [1] F. Nielsen, “Jeﬀreys Centroids: A Closed-Form Expression for Positive Histograms and aGuaranteed Tight Approximation for Frequency Histograms” in IEEE Signal ProcessingLetters , vol. 20, no. 7, pp.657,660, 2013.[2] B. Bigi, “Using Kullback-Leibler distance for text categorization,” in

Proceedings of the25th European conference on IR research (ECIR) , Springer-Verlag, 2003, pp. 305–319.[3] G. Csurka, C. Bray, C. Dance, and L. Fan, “Visual categorization with bags of key-points,”

Workshop on Statistical Learning in Computer Vision (ECCV) , pp. 1–22, 2004.[4] V. Chandrasekhar, G. Takacs, D. M. Chen, S. S. Tsai, Y. A. Reznik, R. Grzeszczuk, andB. Girod, “Compressed histogram of gradients: A low-bitrate descriptor,”

InternationalJournal of Computer Vision , vol. 96, no. 3, pp. 384–399, 2012.[5] H. Jeﬀreys, “An invariant form for the prior probability in estimation problems,”

Pro-ceedings of the Royal Society of London , vol. 186, no. 1007, pp. 453–461, 1946.[6] J. Lin, “Divergence measures based on the Shannon entropy,”

IEEE Transactions onInformation Theory , vol. 37, pp. 145–151, 1991.[7] H. Chernoﬀ, “A measure of asymptotic eﬃciency for tests of a hypothesis based on thesum of observations,”

Annals of Mathematical Statistics , vol. 23, pp. 493–507, 1952.[8] H. Liu and R. Setiono, “Chi2: Feature selection and discretization of numeric at-tributes,” in

Proceedings of the IEEE International Conference on Tools with ArtiﬁcialIntelligence (TAI) . 1995, pp. 388–391.[9] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman diver-gences,”

Journal of Machine Learning Research , vol. 6, pp. 1705–1749, 2005.[10] M. Mignotte, “Segmentation by fusion of histogram-based k -means clusters in diﬀerentcolor spaces,” IEEE Transactions on Image Processing (TIP) , vol. 17, no. 5, pp. 780–787, 2008.[11] F. Nielsen and S. Boltz, “The Burbea-Rao and Bhattacharyya centroids,”

IEEE Trans-actions on Information Theory , vol. 57, no. 8, pp. 5455–5466, 2011.[12] R. N. J. Veldhuis, “The centroid of the symmetrical Kullback-Leibler distance,”

IEEEsignal processing letters , vol. 9, no. 3, pp. 96–99, 2002.[13] F. Nielsen and R. Nock, “Sided and symmetrized Bregman centroids,”

IEEE Transac-tions on Information Theory , vol. 55, no. 6, pp. 2048–2059, June 2009.[14] R. Nock, P. Luosto, and J. Kivinen, “Mixed Bregman clustering with approxima-tion guarantees,” in

Proceedings of the European conference on Machine Learning andKnowledge Discovery in Databases . Springer-Verlag, 2008, pp. 154–169.1015] D. A. Barry, P. J. Culligan-Hensley, and S. J. Barry, “Real values of the W -function,” ACM Trans. Math. Softw. , vol. 21, no. 2, pp. 161–171, 1995.[16] W. Szpankowski, “On asymptotics of certain recurrences arising in universal coding,”

Problems of Information Transmission , vol 34, no 2, pp. 142–146, 1998.[17] G. Griﬃn, A. Holub, and P. Perona, “Caltech-256 object category dataset,” CaliforniaInstitute of Technology, Tech. Rep. 7694, 2007.[18] F. Nielsen, “A family of statistical symmetric divergences based on Jensen’s inequality,”CoRR 1009.4004, 2010.[19] A. Granas and J. Dugundji, “Fixed Point Theory,”

Springer Monographs inMathematics , 2003. options ( digits =22) PRECISION =. Machine $ double . xmin setwd ("C:/ testR ") lenah = read . table (" Lena . histogram ") barbarah = read . table (" Barbara . histogram ") LambertW0 <- function (z) { S = 0.0 for (n in 1:100) { Se = S * exp (S) S1e = Se + exp (S) if ( PRECISION > abs ((z-Se)/ S1e )) { break } S = S - (Se -z) / ( S1e - (S +2) * (Se -z) / (2*S +2) ) } S } JeffreysDivergence <- function (p,q) { d= length (p) res =0 for (i in 1:d) { res = res +(p[i]-q[i])* log (p[i]/q[i])} res } normalizeHistogram <- function (h) { d= length (h) result = rep (0 ,d) wnorm = cumsum (h)[d] for (i in 1:d) { result [i]=h[i]/ wnorm ; } result } ArithmeticCenter <- function (set , weight ) { dd= dim ( set ) n=dd [1] d=dd [2] result = rep (0 ,d) for (j in 1:d) { for (i in 1:n) { result [j]= result [j]+ weight [i]* set [i,j]} } result } GeometricCenter <- function (set , weight ) { dd= dim ( set ) n=dd [1] d=dd [2] result = rep (1 ,d) for (j in 1:d) { for (i in 1:n) { result [j]= result [j]*( set [i,j ]^( weight [i]))} } result } cumulativeSum <- function (h) { d= length (h) cumsum (h)[d] } JeffreysPositiveCentroid <- function (set , weight ) { dd= dim ( set ) d=dd [2] result = rep (0 ,d) a= ArithmeticCenter (set , weight ) g= GeometricCenter (set , weight ) for (i in 1:d) { result [i]=a[i]/ LambertW0 ( exp (1) *a[i]/g[i])} result } JeffreysFrequencyCentroidApproximation <- function (set , weight ) { result = JeffreysPositiveCentroid (set , weight ) w= cumulativeSum ( result ) result = result /w result } cumulativeCenterSum <- function (na ,ng , lambda ) { d= length (na) cumul =0 for (i in 1:d) { cumul = cumul +( na[i]/ LambertW0 ( exp ( lambda +1) *na[i]/ng[i]))} cumul } JeffreysFrequencyCenter <- function (na ,ng , lambda ) { d= length (na) result = rep (0 ,d) for (i in 1:d) { result [i]= na[i]/ LambertW0 ( exp ( lambda +1) *na[i]/ng[i])} result } JeffreysFrequencyCentroid <- function (set , weight ) { na= normalizeHistogram ( ArithmeticCenter (set , weight )) ng= normalizeHistogram ( GeometricCenter (set , weight )) d= length (na) bound = rep (0 ,d) for (i in 1:d) { bound [i]= na[i]+ log (ng[i])} lmin = max ( bound ) -1 lmax =0 while (lmax -lmin >1.0e -10) { lambda =( lmax + lmin )/2 cs= cumulativeCenterSum (na ,ng , lambda ); if (cs >1) { lmin = lambda } else { lmax = lambda } } JeffreysFrequencyCenter (na ,ng , lambda ) } demoJeffreysCentroid <- function (){ n <-2 d <-256 weight =c (0.5 ,0.5) set <- array (0 , dim =c(n,d)) set [1 ,]= lenah [ ,1] set [2 ,]= barbarah [ ,1] approxP = JeffreysPositiveCentroid (set , weight ) approxF = JeffreysFrequencyCentroidApproximation (set , weight ) approxFG = JeffreysFrequencyCentroid (set , weight ) pdf ( file =" SKLJeffreys - BarbaraLena . pdf ", paper ="A4") typedraw ="l" plot ( set [1 ,] , type = typedraw , col = " black ", xlab =" grey intensity value ", ylab =" Percentage ") lines ( set [2 ,] , type = typedraw , col = " black " ) lines ( approxP , type = typedraw , col =" blue ") lines ( approxF , type = typedraw , col = " green ") lines ( approxFG , type = typedraw , col =" red ") title ( main =" Jeffreys frequency centroids ", sub =" Barbara / Lena grey histograms " , col . main ="black ", font . main =3) dev . off () graphics . off () } demoJeffreysCentroid () randomUnitHistogram <- function (d) { result = runif (d ,0 ,1) ; result = result /( cumsum ( result )[d]) result } n <-10 d <-25 set <- array (0 , dim =c(n,d)) weight <- runif (n ,0 ,1) ; cumulw <- cumsum ( weight )[n] for (i in 1:n) { set [i ,]= randomUnitHistogram (d) weight [i]= weight [i]/ cumulw ; } A= ArithmeticCenter (set , weight )

G= GeometricCenter (set , weight ) nA= normalizeHistogram (A) nG= normalizeHistogram (G) nApproxAG =0.5 *(nA+nG) optimalP = JeffreysPositiveCentroid (set , weight ) approxF = JeffreysFrequencyCentroidApproximation (set , weight )

AFapprox =1/ cumulativeSum ( optimalP ); approxFG = JeffreysFrequencyCentroid (set , weight ) pdf ( file =" SKLJeffreys - ManyHistograms . pdf ", paper ="A4") typedraw ="l" plot ( set [1 ,] , type = typedraw , col = " black ", xlab =" grey intensity value ", ylab =" Percentage ") for (i in 2:n) { lines ( set [i ,] , type = typedraw , col = " black " )} lines ( approxFG , type = typedraw , col =" blue ") lines ( approxF , type = typedraw , col = " green ") lines ( optimalP , type = typedraw , col =" red ") title ( main =" Jeffreys frequency centroids ", sub =" many randomly sampled frequency histograms ", col . main =" black ", font . main =3) dev . off () Lambda <- function (set , weight ) { na= normalizeHistogram ( ArithmeticCenter (set , weight )) ng= normalizeHistogram ( GeometricCenter (set , weight )) d= length (na) bound = rep (0 ,d) for (i in 1:d) { bound [i]= na[i]+ log (ng[i])} lmin = max ( bound ) -1 lmax =0 while (lmax -lmin >1.0e -10) { lambda =( lmax + lmin )/2 cs= cumulativeCenterSum (na ,ng , lambda ); if (cs >1) { lmin = lambda } else { lmax = lambda } } lambda } JeffreysFrequencyCentroid <- function (set , weight ) { na= normalizeHistogram ( ArithmeticCenter (set , weight )) ng= normalizeHistogram ( GeometricCenter (set , weight )) lambda = Lambda (set , weight ) JeffreysFrequencyCenter (na ,ng , lambda ) } KullbackLeibler <- function (p,q) { d= length (p) res =0 for (i in 1:d) { res = res +p[i]* log (p[i]/q[i])+q[i]-p[i]} res } lambda = Lambda (set , weight ) ct= JeffreysFrequencyCentroid (set , weight ) na= normalizeHistogram ( ArithmeticCenter (set , weight )) ng= normalizeHistogram ( GeometricCenter (set , weight )) lambdap =- KullbackLeibler (ct ,ng) lambda abs ( lambdap - lambda ) OneIteration <- function () { lambda =- KullbackLeibler (c,ng) c= JeffreysFrequencyCenter (na ,ng , lambda ) } NbIteration <- function (kk) { for (i in 1: kk) { OneIteration ()} lambda } Lambda <- function (set , weight ) { na= normalizeHistogram ( ArithmeticCenter (set , weight )) ng= normalizeHistogram ( GeometricCenter (set , weight )) c=na } Same <- function (p,q) { res = TRUE d= length (p) for (i in 1:d) {if (p[i]!=q[i]) res = FALSE } res } FixedPointSKL <- function (set , weight ) { nbiter <<-0 na= normalizeHistogram ( ArithmeticCenter (set , weight )) ng= normalizeHistogram ( GeometricCenter (set , weight )) center =na repeat { lambda =- KullbackLeibler ( center ,ng) nbiter <<- nbiter +1 nc= JeffreysFrequencyCenter (na ,ng , lambda ) if( Same ( center ,nc)== TRUE | ( nbiter >100) ){ break } else { center =nc} } center } lambda = FixedPointSKL (set , weight ) fpc = JeffreysFrequencyCenter (na ,ng , lambda ) cc= JeffreysFrequencyCentroid (set , weight ) cumulativeSum (cc) cumulativeSum ( fpc ) TestJeffreysFrequencyMethods <- function () { n <-10 d <-25 set <- array (0 , dim =c(n,d)) weight <- runif (n ,0 ,1) ; cumulw <- cumsum ( weight )[n] for (i in 1:n) { set [i ,]= randomUnitHistogram (d) weight [i]= weight [i]/ cumulw ;} J1= JeffreysFrequencyCentroid (set , weight )